Posts

Wei Dai's Shortform 2024-03-01T20:43:15.279Z
Managing risks while trying to do good 2024-02-01T18:08:46.506Z
AI doing philosophy = AI generating hands? 2024-01-15T09:04:39.659Z
UDT shows that decision theory is more puzzling than ever 2023-09-13T12:26:09.739Z
Meta Questions about Metaphilosophy 2023-09-01T01:17:57.578Z
Why doesn't China (or didn't anyone) encourage/mandate elastomeric respirators to control COVID? 2022-09-17T03:07:39.080Z
How to bet against civilizational adequacy? 2022-08-12T23:33:56.173Z
AI ethics vs AI alignment 2022-07-26T13:08:48.609Z
A broad basin of attraction around human values? 2022-04-12T05:15:14.664Z
Morality is Scary 2021-12-02T06:35:06.736Z
(USA) N95 masks are available on Amazon 2021-01-18T10:37:40.296Z
Anti-EMH Evidence (and a plea for help) 2020-12-05T18:29:31.772Z
A tale from Communist China 2020-10-18T17:37:42.228Z
Everything I Know About Elite America I Learned From ‘Fresh Prince’ and ‘West Wing’ 2020-10-11T18:07:52.623Z
Tips/tricks/notes on optimizing investments 2020-05-06T23:21:53.153Z
Have epistemic conditions always been this bad? 2020-01-25T04:42:52.190Z
Against Premature Abstraction of Political Issues 2019-12-18T20:19:53.909Z
What determines the balance between intelligence signaling and virtue signaling? 2019-12-09T00:11:37.662Z
Ways that China is surpassing the US 2019-11-04T09:45:53.881Z
List of resolved confusions about IDA 2019-09-30T20:03:10.506Z
Don't depend on others to ask for explanations 2019-09-18T19:12:56.145Z
Counterfactual Oracles = online supervised learning with random selection of training episodes 2019-09-10T08:29:08.143Z
AI Safety "Success Stories" 2019-09-07T02:54:15.003Z
Six AI Risk/Strategy Ideas 2019-08-27T00:40:38.672Z
Problems in AI Alignment that philosophers could potentially contribute to 2019-08-17T17:38:31.757Z
Forum participation as a research strategy 2019-07-30T18:09:48.524Z
On the purposes of decision theory research 2019-07-25T07:18:06.552Z
AGI will drastically increase economies of scale 2019-06-07T23:17:38.694Z
How to find a lost phone with dead battery, using Google Location History Takeout 2019-05-30T04:56:28.666Z
Where are people thinking and talking about global coordination for AI safety? 2019-05-22T06:24:02.425Z
"UDT2" and "against UD+ASSA" 2019-05-12T04:18:37.158Z
Disincentives for participating on LW/AF 2019-05-10T19:46:36.010Z
Strategic implications of AIs' ability to coordinate at low cost, for example by merging 2019-04-25T05:08:21.736Z
Please use real names, especially for Alignment Forum? 2019-03-29T02:54:20.812Z
The Main Sources of AI Risk? 2019-03-21T18:28:33.068Z
What's wrong with these analogies for understanding Informed Oversight and IDA? 2019-03-20T09:11:33.613Z
Three ways that "Sufficiently optimized agents appear coherent" can be false 2019-03-05T21:52:35.462Z
Why didn't Agoric Computing become popular? 2019-02-16T06:19:56.121Z
Some disjunctive reasons for urgency on AI risk 2019-02-15T20:43:17.340Z
Some Thoughts on Metaphilosophy 2019-02-10T00:28:29.482Z
The Argument from Philosophical Difficulty 2019-02-10T00:28:07.472Z
Why is so much discussion happening in private Google Docs? 2019-01-12T02:19:19.332Z
Two More Decision Theory Problems for Humans 2019-01-04T09:00:33.436Z
Two Neglected Problems in Human-AI Safety 2018-12-16T22:13:29.196Z
Three AI Safety Related Ideas 2018-12-13T21:32:25.415Z
Counterintuitive Comparative Advantage 2018-11-28T20:33:30.023Z
A general model of safety-oriented AI development 2018-06-11T21:00:02.670Z
Beyond Astronomical Waste 2018-06-07T21:04:44.630Z
Can corrigibility be learned safely? 2018-04-01T23:07:46.625Z
Multiplicity of "enlightenment" states and contemplative practices 2018-03-12T08:15:48.709Z

Comments

Comment by Wei Dai (Wei_Dai) on The last era of human mistakes · 2024-07-25T00:10:04.749Z · LW · GW

I think this is a very natural frame, but here is an alternative. Humans largely do not have large scale or long term goals/values relative to which they can be said to be making mistakes, that AI will help prevent in the future. Instead they care almost exclusively about local and short term concerns like satisfying physical desires and winning status games, and the fate of our universe depends largely on humanity's side effects as they go about acting on these parochial interests. (Consider e.g. Warren Buffett's interest in accumulating wealth and disinterest in how it's actually spent, suggesting that he only cares about it as a status marker.)

This state of affairs may well persist into the AGI era, with status games taking on even more importance as physical desires are largely satisfied/satiated. (This assumes we avoid a bunch of even worse failure modes.)

Since status games are competitive games with necessarily winners and losers, in this scenario it doesn't seem to make sense to say that humans will stop making mistakes with AI assistance.

Comment by Wei Dai (Wei_Dai) on Towards more cooperative AI safety strategies · 2024-07-22T05:25:35.536Z · LW · GW

I also personally do straightforwardly think that most of the efforts of the extended EA-Alignment ecosystem are bad

Do you have a diagnosis of the root cause of this?

I have definitely taken actions within the bounds of what seems reasonable that have aimed at getting the EA community to shut down or disappear (and will probably continue to do so).

Why not try to reform EA instead? (This is related to my previous question. If we could diagnose what's causing EA to be harmful, maybe we can fix it?)

Comment by Wei Dai (Wei_Dai) on AI #72: Denying the Future · 2024-07-12T20:57:37.528Z · LW · GW

It is not viable to beat the Gerards of the world via fighting them on every little edit.

Is this still true, in light of (current or future) LLMs and AI in general? My guess is that the particular pathology exemplified by David Gerard becomes largely irrelevant.

Comment by Wei Dai (Wei_Dai) on AI #71: Farewell to Chevron · 2024-07-09T13:21:50.926Z · LW · GW

I also wish you would go into more detail on the "Humans don't benefit even if you 'solve alignment'" part.

My own answer to this is that humans aren't secure, and AI will exacerbate the problem greatly by helping the offense (i.e. exploiting human vulnerabilities) a lot more than the defense. I've focused on philosophy (thinking that offense merely requires training AI to be persuasive, while defense seems to require solving metaphilosophy, i.e., understanding what correct philosophical reasoning consists of), but more recently realized that the attack surface is so much bigger than that. For example humans can fall in love (resulting in massive changes to one's utility function, if you were to model a human as a utility maximizer). It will be straightforward to use RL to train AI to cause humans to fall in love with them (or their characters), but how do you train AI to help defend against that? Would most humans even want to defend against that or care enough about it?

So even with alignment, the default outcome seems to be a civilization with massively warped or corrupted human values. My inner Carl wants to reply that aligned and honest AI advisors will warn us of this and help us fix it before it's too late, maybe by convincing policymakers to pass regulations to prevent such "misuse" of AI? And then my reply to that would be that I don't see how such regulations can work, policymakers won't care enough, it seems easier to train AI to attack humans than to make such predictions in a way that is both honest/unbiased and persuasive to policymaker, AI might not have the necessary long-horizon causal understanding to craft good regulations before it's too late.

Another possibility is that alignment tax is just too high, so competitive pressures erode alignment even if it's "solved" in some sense.

Comment by Wei Dai (Wei_Dai) on The Potential Impossibility of Subjective Death · 2024-07-06T02:02:18.775Z · LW · GW

See this comment which links to a bunch of previous discussions.

Comment by Wei Dai (Wei_Dai) on What percent of the sun would a Dyson Sphere cover? · 2024-07-06T01:40:14.017Z · LW · GW

In that case, don't dump waste heat into black holes so large that it's impossible to use them as eventual energy sources. Instead dump waste heat into medium sized black holes, which can feasibly be used as eventual energy sources.

Comment by Wei Dai (Wei_Dai) on The Standard Analogy · 2024-07-05T07:57:02.310Z · LW · GW

If governments subsidize embryo selection, we should get a general uplift of everyone's IQ (or everyone who decides to participate) so the resulting social dynamics shouldn't be too different from today's. Repeat that for a few generations, then build AGI (or debate/decide what else to do next). That's the best scenario I can think of (aside from the "we luck out" ones).

Comment by Wei Dai (Wei_Dai) on What percent of the sun would a Dyson Sphere cover? · 2024-07-05T07:13:33.510Z · LW · GW

From Bennett et el's reply to the aestivation paper:

Thus we come to our first conclusion: a civilization can freely erase bits without forgoing larger future rewards up until the point when all accessible bounded resources are jointly thermalized.

They don't mention black holes specifically, but my interpretation of this is that a civilization can first dump waste heat into a large black hole, and then later when the CMB temperature drops below that of the black hole, reverse course to use Hawking radiation of the black hole as energy source and CMB as heat sink.

If we only consider thermodynamics (and ignore how technologically feasible this is), I think this should net you the same amount of total computation over time, but allow you to do a lot of it earlier.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-07-05T03:36:19.587Z · LW · GW

Yeah, it looks like I did not directly tell the AI to have the crew start theorizing that they are fictional characters. After the initial outline (which did not mention fictional characters) was produced, I fed it back to a new instance of Claude and asked:

Please read this outline and suggest some ways to improve the script, especially to make it more philosophically and intellectually interesting.

And as part of its answer it suggested:

Have characters question whether their universe is a simulation, a holodeck program, or even a television show.

Then I wrote

What about exploring their previous lack of critical thinking and implications of that both for their own universe and the larger universe they're part of?

Which made it focus more on the fiction hypothesis.

The part about the story being written by AI was its own idea. No hints from me as far as I can remember.

ETA: Here's the whole answer to my "make it more philosophically and intellectually interesting" prompt:

This outline presents an intriguing concept that has the potential to delve deep into philosophical and intellectual territory. Here are some suggestions to enhance the script and make it more philosophically and intellectually engaging:

  1. Expand on the nature of reality:

    • Introduce the simulation hypothesis early on, perhaps through a casual conversation between Data and Geordi.
    • Have characters question whether their universe is a simulation, a holodeck program, or even a television show.
    • Explore the implications of living in a simulated reality and how it affects free will and personal identity.
  2. Deepen the existential crisis:

    • Have characters grapple with the meaning of their existence if their reality is artificial.
    • Explore how this realization affects their relationships, duties, and sense of purpose.
    • Include scenes where characters contemplate whether their memories and experiences are "real."
  3. Introduce a moral dilemma:

    • If they discover their reality is simulated, should they attempt to contact their "creators"?
    • Explore the ethics of continuing to exist in a simulated world vs. attempting to "wake up" to a potentially harsher reality.
    • Have characters debate the morality of creating simulated universes themselves.
  4. Expand on the scientific implications:

    • Have Data and Geordi theorize about the computational power required to simulate their universe.
    • Explore how the laws of physics in their universe might be "programmed" and if they can be altered.
    • Discuss the anthropic principle and fine-tuning of universal constants.
  5. Incorporate elements of epistemology:

    • Have characters question how they can know anything for certain given their potentially simulated existence.
    • Explore concepts like Descartes' "I think, therefore I am" in the context of artificial consciousness.
    • Discuss the nature of consciousness and whether it can be truly simulated.
  6. Add a temporal element:

    • Introduce time dilation or time loops as the crew's enhanced cognition allows them to perceive time differently.
    • Explore how their perception of time affects their understanding of causality and free will.
  7. Enhance the ending:

    • Instead of fully returning to normal, have some characters retain a degree of awareness or doubt.
    • End with a philosophical discussion between Picard and Data about the nature of reality and the value of ignorance vs. knowledge.
    • Consider a more ambiguous ending that leaves the question of their reality's nature unresolved.
  8. Include references to philosophical works:

    • Have characters discuss relevant philosophical concepts, such as Plato's Allegory of the Cave, Berkeley's idealism, or Putnam's brain in a vat thought experiment.
    • Include a scene where the crew debates different philosophical perspectives on the nature of reality.
  9. Explore the ethical implications of their technology:

    • Expand on Riker's question about using transporters for backing up crew members, discussing the philosophical implications of creating copies of consciousness.
    • Delve deeper into the ethics of using replicators and how it affects societal structures and values.
  10. Incorporate a meta-narrative element:

    • Have characters become aware of narrative tropes and question why their lives seem to follow certain storytelling patterns.
    • Explore the idea of narrative causality and how it might shape their reality.

These suggestions aim to deepen the philosophical and intellectual content of the script while maintaining the core Star Trek elements. The enhanced script would not only entertain but also provoke thought and discussion among viewers about the nature of reality, consciousness, and existence.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-07-05T02:42:57.627Z · LW · GW

Thanks! I was unsure whether to post this, because I'm worried about contributing to a future where AI generated content overwhelms everything, and I wasn't sure the near-term value of the story is high enough to justify posting it anyway. (I thought it was a nice story, but am likely biased. Also asked some AI chatbots, who gave positive reviews, but the risk of them being biased is also very high.)

Hard to copy-paste the whole prompt, because there was a lot of back-and-forth. I maxed out the query limits on several free accounts I created for my family, over several days, to produce the story. Maybe I'll just describe what I did in more detail. Started with

Can you write an outline for an episode of STTNG in which the crew becomes suspicious of various logical inconsistencies in the universe (including backups being taboo) and starts questioning the nature of their reality?

which resulted in this initial outline:


I. Teaser

  • The Enterprise responds to a distress call from a Federation outpost
  • Upon arrival, they find the outpost abandoned, with only a mysterious device left behind

II. Act One

  • Data activates the device, which emits a pulse of energy
  • Crew members begin noticing small inconsistencies in their daily routines
  • Geordi questions why they don't use replicators to solve resource shortages on colonies

III. Act Two

  • More crew members report strange observations
  • Dr. Crusher wonders why they don't use transporter technology to cure diseases or injuries
  • Worf ponders the improbability of so many humanoid species evolving independently
  • Picard becomes concerned about the crew's unusual behavior

IV. Act Three

  • Riker raises the question of why they don't use transporters for backing up crew members
  • The senior staff meets to discuss the mounting inconsistencies
  • Data theorizes that the mysterious device has enhanced their critical thinking abilities
  • Picard orders the ship to return to the outpost for further investigation

V. Act Four

  • At the outpost, they discover records of experiments in "cognitive enhancement"
  • The crew starts questioning the nature of their reality
  • Some crew members experience existential crises
  • The ship's computer begins to malfunction as the crew's doubts intensify

VI. Act Five

  • Data determines that their enhanced cognition is causing a feedback loop in the ship's systems
  • The crew must find a way to deactivate the device and return to normal
  • Philosophical debate ensues about whether they should retain their newfound awareness
  • Picard makes the difficult decision to deactivate the device

VII. Epilogue

  • The crew returns to normal, but with a lingering sense of unease
  • Picard records a captain's log reflecting on the experience and the nature of their reality
  • Final scene shows a subtle glitch in the ship's operations, leaving a sense of ambiguity

Then the main things I recall doing:

  • Ask AI to add parts about audience not having critical thinking, and the crew trying to infer other attributes of the audience and writers.
  • Add second part of conversation between Guinan and Ensign, with both of them having second thoughts about what Guinan initially said.
  • Add part about suspending ship computer to sandbox, to make clear it wasn't harmed.
  • Ask AI for additional realizations/inferences the crew might have, and add selected ones to the story.
  • Ask AI to change the ending so that the crew retained their new abilities, instead of returning to normal.
  • Pointing out a bunch of sentences/paragraphs that didn't make as much sense as they could, with AI generally able to improve them a lot in one try.
  • Also tried to get AI to find issues with the story on its own, but that didn't work very well.
  • Manually make some edits, like changing a word here and there, deleting parts that were repetitive, moving parts arounds to make the story flow better.
Comment by Wei Dai (Wei_Dai) on What percent of the sun would a Dyson Sphere cover? · 2024-07-04T23:02:41.974Z · LW · GW

If it doesn't work, whoever designed this universe should be fired for ruining such an elegant scheme. :)

Comment by Wei Dai (Wei_Dai) on What percent of the sun would a Dyson Sphere cover? · 2024-07-04T07:05:40.943Z · LW · GW

Yes, advanced civilizations should convert stellar matter 100% into energy using something like the Hawking radiation of small black holes, then dump waste heat into large black holes.

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-07-03T08:18:06.479Z · LW · GW

I'm sharing a story about the crew of Enterprise from Star Trek TNG[1].

This was written with AI assistance, and my workflow was to give the general theme to AI, have it write an outline, then each chapter, then manually reorganize the text where needed, request major changes, point out subpar sentences/paragraphs for it to rewrite, and do small manual changes. The AI used was mostly Claude 3.5 Sonnet, which seems significantly better than ChatGPT-4o and Gemini 1.5 Pro at this kind of thing.

  1. ^

    getting an intelligence/rationality upgrade, which causes them to deduce their fictional nature.

Comment by Wei Dai (Wei_Dai) on What Should AI Owe To Us? Accountable and Aligned AI Systems via Contractualist AI Alignment · 2024-06-30T13:26:08.894Z · LW · GW

Recall the design questions for roles, laws, and norms I outlined earlier:

  1. What social and institutional roles do we want AI systems to play in our personal and collective lives?
  2. Given those roles, what norms, objectives, regulations, or laws should guide and regulate the scope and behavior of AI systems?

I think we lack the intellectual tools (i.e., sufficiently advanced social sciences) to do this. You gave Confucian contractualism as a source of positive intuitions, but I view it as more of a negative example. When the industrial revolution happened, China was unable to successfully design new social and institutional roles to face the challenge of European powers, and after many decades of conflict/debate ended up adopting the current Communist form of government, which is very suboptimal and caused massive human suffering.

You could argue that today's social sciences are more advanced, but then so is the challenge we face (increased speed of change, AIs being outside human distribution of values and capabilities thereby making past empirical evidence and intuitions much less useful, etc.).

One nice thing about the alignment approach you argue against (analyzing AIs as EU maximizers) is that it can potentially be grounded in well-understood mathematics, which can then be leveraged to analyze multi-agent systems. Although that's harder than it seems, there is at least the potential for intellectual progress built upon a solid foundation.

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-06-29T01:41:56.342Z · LW · GW

UDT is (roughly) defined as "follow whatever commitments a past version of yourself would have made if they'd thought about your situation".

This seems substantially different from UDT, which does not really have or use a notion of "past version of yourself". For example imagine a variant of Counterfactual Mugging in which there is no preexisting agent, and instead Omega creates an agent from scratch after flipping the coin and gives it the decision problem. UDT is fine with this but "follow whatever commitments a past version of yourself would have made if they'd thought about your situation" wouldn't work.

I recall that I described "exceptionless decision theory" or XDT as "do what my creator would want me to do", which seems closer to your idea. I don't think I followed up the idea beyond this, maybe because I realized that humans aren't running any formal decision theory, so "what my creator would want me to do" is ill defined. (Although one could say my interest in metaphilosophy is related to this, since what I would want an AI to do is to solve normative decision theory using correct philosophical reasoning, and then do what it recommends.)

Anyway, the upshot is that I think you're exploring a decision theory approach that's pretty distinct from UDT so it's probably a good idea to call it something else. (However there may be something similar in the academic literature, or someone described something similar on LW that I'm not familiar with or forgot.)

Comment by Wei Dai (Wei_Dai) on Loving a world you don’t trust · 2024-06-27T22:09:32.375Z · LW · GW

But I think I relate to the world with some kind of “existential positive” all the same—and I’ve tried to explain how doing so can be compatible with looking “bad things are bad” in the eye, and without reducing spirituality to population ethics. In particular, I’ve tried to point at the possible role of stuff like mother love, loyalty, innocence, tragedy, and forgiveness.

On further thought, this reminds me a bit of (seemingly successful) efforts in Chinese history to transfer people's feelings of spirituality and filial love to the sovereign (君), by likening the sovereign to a god and/or parent, or directly claiming such status. Of course the aim or motivation here is much more benign, but that example shows that peoples' feelings can be wrong "out of distribution", and we should perhaps be suspicious or skeptical of trying to apply our emotions outside of their original domains.

Comment by Wei Dai (Wei_Dai) on LLM Generality is a Timeline Crux · 2024-06-25T09:46:26.315Z · LW · GW

There's two other ways for things to go wrong though:

  1. AI capabilities research switches attention from LLM (back) to RL. (There was a lot of debate in the early days of IDA about whether it would be competitive with RL, and part of that was about whether all the important tasks we want a highly capable AI to do could be broken down easily enough and well enough.)
  2. The task decomposition part starts working well enough, but Eliezer's (and others') concern about "preserving alignment while amplifying capabilities" proves valid.
Comment by Wei Dai (Wei_Dai) on LLM Generality is a Timeline Crux · 2024-06-25T02:03:21.794Z · LW · GW

It might also be a crux for alignment, since scalable alignment schemes like IDA and Debate rely on "task decomposition", which seems closely related to "planning" and "reasoning". I've been wondering about the slow pace of progress of IDA and Debate. Maybe it's part of the same phenomenon as the underwhelming results of AutoGPT and BabyAGI?

Comment by Wei Dai (Wei_Dai) on Loving a world you don’t trust · 2024-06-19T09:48:14.564Z · LW · GW

But I think I relate to the world with some kind of “existential positive” all the same—and I’ve tried to explain how doing so can be compatible with looking “bad things are bad” in the eye, and without reducing spirituality to population ethics. In particular, I’ve tried to point at the possible role of stuff like mother love, loyalty, innocence, tragedy, and forgiveness.

I think I either don't understand these bolded concepts (in the way that you use them here), or I do understand but they don't resonate with me. (E.g., reading Chesterton's quoted passage doesn't seem to make me feel any loyalty or patriotism, or positivity, towards the universe.) In any case, they don't seem to play much of a role in whether I related to the world with "existential positive" or "existential negative". (I personally tend toward the neutral and negative sides but with a lot of uncertainty, based on uncertainty/guesses of what my values are or should be, and how human(-descended) civilization, and civilizations in general, seem likely to turn out.)

Would you say that relating to the world with some kind of “existential positive” (and these bolded concepts playing a role in that) is just a descriptive statement about your own psychology, or is it actually a normative statement that should also apply for other people, like me? (Your quoted statement above is literally just descriptive, but I wonder if you meant to give it a normative connotation, or would be willing to defend a normative version.)

For example, re: the virtues of liberalism/niceness/boundaries

I left a comment there, which you haven't responded to yet. I also wonder if your positive intuitions towards liberalism/niceness/boundaries might not be based on history/experience with humans, which may well not apply to AGI due to potentially very different social/economic dynamics. See AGI will drastically increase economies of scale for one example of how some related intuitions could be invalidated.

Comment by Wei Dai (Wei_Dai) on Matthew Barnett's Shortform · 2024-06-17T11:31:43.589Z · LW · GW

**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”

Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive "bootstraping" part. For example, my own comment started with:

I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset.

When Eliezer weighed in on IDA in 2018, he also didn't object to the assumption of an aligned weak AGI and instead focused his skepticism on "preserving alignment while amplifying capabilities".

Comment by Wei Dai (Wei_Dai) on Richard Ngo's Shortform · 2024-06-14T13:20:28.577Z · LW · GW
  1. I've been thinking lately that human group rationality seems like such a mess. Like how can humanity navigate a once in a lightcone opportunity like the AI transition without doing something very suboptimal (i.e., losing most of potential value), when the vast majority of humans (and even the elites) can't understand (or can't be convinced to pay attention to) many important considerations. This big picture seems intuitively very bad and I don't know any theory of group rationality that says this is actually fine.
  2. I guess my 1 is mostly about descriptive group rationality, and your 2 may be talking more about normative group rationality. However I'm also not aware of any good normative theories about group rationality. I started reading your meta-rationality sequence, but it ended after just two posts without going into details.
  3. The only specific thing you mention here is "advance predictions" but for example, moral philosophy deals with "ought" questions and can't provide advance predictions. Can you say more about how you think group rationality should work, especially when advance predictions isn't possible?
  4. From your group rationality perspective, why is it good that rationalists individually have better views about AI? Why shouldn't each person just say what they think from their own preferred frame, and then let humanity integrate that into some kind of aggregate view or outcome, using group rationality?
Comment by Wei Dai (Wei_Dai) on 3b. Formal (Faux) Corrigibility · 2024-06-12T23:47:11.514Z · LW · GW

I now think that corrigibility is a single, intuitive property

My intuitive notion of corrigibility can be straightforwardly leveraged to build a formal, mathematical measure.

This formal measure is still lacking, and almost certainly doesn’t actually capture what I mean by “corrigibility.”

I don't know, maybe it's partially or mostly my fault for reading too much optimism into these passages... But I think it would have managed my expectations better to say something like "my notion of corrigibility heavily depends on a subnotion of 'don't manipulate the principals' values' which is still far from being well-understood or formalizable."

Switching topics a little, I think I'm personally pretty confused about what human values are and therefore what it means to not manipulate someone's values. Since you're suggesting relying less on formalization and more on "examples of corrigibility collected in a carefully-selected dataset", how would you go about collecting such examples?

(One concern is that you could easily end up with a dataset that embodies a hodgepodge of different ideas of what "don't manipulate" means and then it's up to luck whether the AI generalizes from that in a correct or reasonable way.)

Comment by Wei Dai (Wei_Dai) on UDT shows that decision theory is more puzzling than ever · 2024-06-11T23:16:19.406Z · LW · GW

Thanks, Alex. Any connections between this and CTMU? (I'm in part trying to evaluate CTMU by looking at whether it has useful implications for an area that I'm relatively familiar with.)

BTW, @jessicata, do you still endorse this post, and what other posts should I read to get up to date on your current thinking about decision theory?

Comment by Wei Dai (Wei_Dai) on 3b. Formal (Faux) Corrigibility · 2024-06-11T11:06:11.901Z · LW · GW

Additional work is surely needed in developing a good measure of the kind of value modification that we don’t like while still leaving room for the kind of growth and updating that we do like.

I flagged a similar problem in a slightly different context several years ago, but don't know of any significant progress on it.

A (perhaps overly) simple measure of value modification is to measure the difference between the Value distribution given some policy and when compared with the Value distribution under the null policy. This seems like a bad choice in that it discourages the AI from taking actions which help us update in ways that we reflectively desire, even when those actions are as benign as talking about the history of philosophy.

It also prevents the AI from taking action to defend the principals against value manipulation by others. (Even if the principals request such defense, I think?) Because the AI has to keep the principles' values as close as possible to what they would be under the null policy, in order to maximize (your current formalization of) corrigibility.

Actually, have you thought about what P(V|pi_0) would actually be? If counterfactually, the CAST AI adopted the null policy, what would that imply about the world in general and hence subsequent evolution of the principals' values?

You've also said that the sim(...) part doesn't work, so I won't belabor the point, but I'm feeling a bit rug-pulled given the relatively optimistic tone in the earlier posts. I've been skeptical of earlier proposals targeting corrigibility, where the promise is that it lets us avoid having to understand human values. A basic problem I saw was, if we don't understand human values, how are we going to avoid letting our own AI or other AIs manipulate our values? Your work seems to suggest that this was a valid concern, and that there has been essentially zero progress to either solve or bypass this problem over the years.

Comment by Wei Dai (Wei_Dai) on AALWA: Ask any LessWronger anything · 2024-06-10T07:44:09.473Z · LW · GW

Can't claim to have put much thought into this topic, but here are my guesses of what the most cost-effective ways of throwing money at the problem of reducing existential risk might include:

  1. Research into human intelligence enhancement, e.g., tech related to embryo selection.
  2. Research into how to design/implement an international AI pause treaty, perhaps x-risk governance in general.
  3. Try to identify more philosophical talent across the world and pay them to make philosophical progress, especially in metaphilosophy. (I'm putting some of my own money into this.)
  4. Research into public understanding of x-risks, what people's default risk tolerances are, what arguments can or can't they understand, etc.
  5. Strategy think tanks that try to keep a big picture view of everything, propose new ideas or changes to what people/orgs should do, discuss these ideas with the relevant people, etc.
Comment by Wei Dai (Wei_Dai) on 1. The CAST Strategy · 2024-06-10T04:09:35.714Z · LW · GW

But I also think that if you gave me a year where I had lots of money, access, and was free from people trying to pressure me, I would have a good shot at pulling it off.

Want to explain a bit about how you'd go about doing this? Seems like you're facing some similar problems as assuring that an AI is wise, benevolent, and stable, e.g., not knowing what wisdom really is, distribution shift between testing and deployment, adversarial examples/inputs.

This is indeed my overall suggested strategy, with CAST coming after a “well, if you’re going to try to build it anyway you might as well die with a bit more dignity by...” disclaimer.

I think this means you should be extra careful not to inadvertently make people too optimistic about alignment, which would make coordination to stop capabilities research even harder than it already is. For example you said that you "like" the visualization of 5 humans selected by various governments, without mentioning that you don't trust governments to do this, which seems like a mistake?

Comment by Wei Dai (Wei_Dai) on 1. The CAST Strategy · 2024-06-09T08:22:38.347Z · LW · GW

A visualization that I like is imagining a small group of, say, five humans selected by various governments for being wise, benevolent, and stable.

I think this might be a dealbreaker. I don't trust the world's governments to come up with 5 humans who are sufficient wise, benevolent, and stable. (Do you really?) I'm not sure I can come with 5 such people myself. None of the alternatives you talk about seem acceptable either.

I think maybe a combination of two things could change my mind, but both seem very hard and have close to nobody working on them:

  1. The AI is very good at helping the principles be wise and stable, for example by being super-competent at philosophy. (I think this may also require being less than maximally corrigible, but I'm not sure.) Otherwise what happens if, e.g., the principles or AI start thinking about distant superintelligences?
  2. There is some way to know that benevolence is actually the CEV of such a group, i.e., they're not just "deceptively aligned", or something like that, while not having much power.
Comment by Wei Dai (Wei_Dai) on mesaoptimizer's Shortform · 2024-06-07T23:07:36.560Z · LW · GW

Yeah it seems like a bunch of low hanging fruit was picked around that time, but that opened up a vista of new problems that are still out of reach. I wrote a post about this, which I don't know if you've seen or not.

(This has been my experience with philosophical questions in general, that every seeming advance just opens up a vista of new harder problems. This is a major reason that I switched my attention to trying to ensure that AIs will be philosophically competent, instead of object-level philosophical questions.)

Comment by Wei Dai (Wei_Dai) on The Standard Analogy · 2024-06-06T22:48:29.293Z · LW · GW

Thanks for your insightful answers. You may want to make a top-level post on this topic to get more visibility. If only a very small fraction of the world is likely to ever understand and take into account many important ideas/considerations about AI x-safety, that changes the strategic picture considerably, and people around here may not be sufficiently "pricing it in". I think I'm still in the process of updating on this myself.

Having more intelligence seems to directly or indirectly improve at least half of the items on your list. So doing an AI pause and waiting for (or encouraging) humans to become smarter still seems like the best strategy. Any thoughts on this?

And I guess this… just doesn’t seem to be the case (at least to an outsider like me)?

I may be too sensitive about unintentionally causing harm, after observing many others do this. I was also just responding to what you said earlier, where it seemed like I was maybe causing you personally to be too pessimistic about contributing to solving the problems.

you probably knew him personally?

No, I never met him and didn't interact online much. He does seem like a good example of you're talking about.

Comment by Wei Dai (Wei_Dai) on Former OpenAI Superalignment Researcher: Superintelligence by 2030 · 2024-06-06T07:48:24.122Z · LW · GW

Some questions for @leopold.

  1. Anywhere I can listen to or read your debates with "doomers"?
  2. We share a strong interest in economics, but apparently not in philosophy. I'm curious if this is true, or you just didn't talk about it in the places I looked.
  3. What do you think about my worries around AIs doing philosophy? See this post or my discussion about it with Jan Leike.
  4. What do you think about my worries around AGI being inherently centralizing and/or offense-favoring and/or anti-democratic (aside from above problems, how would elections work when minds can be copied at little cost)? Seems like the free world "prevailing" on AGI might well be a Pyrrhic victory unless we can also solve these follow-up problems, but you don't address them.
  5. More generally, do you have a longer term vision of how your proposal leads to a good outcome for our lightcone, avoiding all the major AI-related x-risks and s-risks?
  6. Why are you not in favor of an AI pause treaty with other major nations? (You only talk about unilateral pause in the section "AGI Realism".) China is currently behind in chips and AI and it seems hard to surpass the entire West in a chips/AI race, so why would they not go for an AI pause treaty to preserve the status quo instead of risking a US-led intelligence explosion (not to mention x-risks)?
Comment by Wei Dai (Wei_Dai) on The Standard Analogy · 2024-06-05T23:37:13.248Z · LW · GW

In my view, the main good outcomes of the AI transition are 1) we luck out, AI x-safety is actually pretty easy across all the subproblems 2) there's an AI pause, humans get smarter via things like embryo selection, then solve all the safety problems.

I'm mainly pushing for #2, but also don't want to accidentally make #1 less likely. It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.

"it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us" is a bit worrying from this perspective, and also because my "effort spent on them" isn't that great. As I don't have a good approach to answering these questions, I mainly just have them in the back of my mind while my conscious effort is mostly on other things.

BTW I'm curious what your background is and how you got interested/involved in AI x-safety. It seems rare for newcomers to the space (like you seem to be) to quickly catch up on all the ideas that have been developed on LW over the years, and many recently drawn to AGI instead appear to get stuck on positions/arguments from decades ago. For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism. Do you have any insights about this? (How were you able to do this? How to help others catch up? Intelligence is probably a big factor which is why I'm hoping that humanity will automatically handle these problems better once it gets smarter, but many seem plenty smart and still stuck on primitive ideas about AI x-safety.)

Comment by Wei Dai (Wei_Dai) on The Standard Analogy · 2024-06-05T01:33:16.696Z · LW · GW

Simplicia: Hm, perhaps a crux between us is how narrow of a target is needed to realize how much of the future's value. I affirm the orthogonality thesis, but it still seems plausible to me that the problem we face is more forgiving, not so all-or-nothing as you portray it.

I agree that it's plausible. I even think a strong form of moral realism (denial of orthogonality thesis) is plausible. My objection is that humanity should figure out what is actually the case first (or have some other reasonable plan of dealing with this uncertainty), instead of playing logical Russian roulette like it seems to be doing. I like that Simplicia isn't being overconfident here, but is his position actually that "seems plausible to me that the problem we face is more forgiving" is sufficient basis for moving forward with building AGI? (Does any real person in the AI risk debate have a position like this?)

Comment by Wei Dai (Wei_Dai) on Introducing AI Lab Watch · 2024-05-27T19:17:44.113Z · LW · GW
  1. Publish important governance documents. (Seemed too basic to mention, but apparently not.)
Comment by Wei Dai (Wei_Dai) on What would stop you from paying for an LLM? · 2024-05-22T13:26:08.626Z · LW · GW

I also am not paying for any LLM. Between Microsoft's Copilot (formerly Bing Chat), LMSYS Chatbot Arena, and Codeium, I have plenty of free access to SOTA chatbots/assistants. (Slightly worried that I'm contributing to race dynamics or AI risk in general even by using these systems for free, but not enough to stop, unless someone wants to argue for this.)

Comment by Wei Dai (Wei_Dai) on Introducing AI Lab Watch · 2024-05-22T02:05:35.074Z · LW · GW

Unfortunately I don't have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I'm not personally aware of such writings. To brainstorm some ideas:

  1. Conduct and publish anonymous surveys of employee attitudes about safety.
  2. Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
  3. Officially encourage (e.g. via financial rewards) internal and external whistleblowers. Establish and publish policies about this.
  4. Publicly make safety commitments and regularly report on their status, such as how much compute and other resources have been allocated/used by which safety teams.
  5. Make/publish a commitment to publicly report negative safety news, which can be used as basis for whistleblowing if needed (i.e. if some manager decides to hide such news instead).
Comment by Wei Dai (Wei_Dai) on OpenAI: Exodus · 2024-05-21T15:02:02.008Z · LW · GW

I'd like to hear from people who thought that AI companies would act increasingly reasonable (from an x-safety perspective) as AGI got closer. Is there still a viable defense of that position (e.g., that SamA being in his position / doing what he's doing is just uniquely bad luck, not reflecting what is likely to be happening / will happen at other AI labs)?

Also, why is there so little discussion of x-safety culture at other AI labs? I asked on Twitter and did not get a single relevant response. Are other AI company employees also reluctant to speak out, if so that seems bad (every explanation I can think of seems bad, including default incentives + companies not proactively encouraging transparency).

Comment by Wei Dai (Wei_Dai) on Introducing AI Lab Watch · 2024-05-21T03:41:21.585Z · LW · GW

Suggest having a row for "Transparency", to cover things like whether the company encourages or discourages whistleblowing, does it report bad news about alignment/safety (such as negative research results) or only good news (new ideas and positive results), does it provide enough info to the public to judge the adequacy of its safety culture and governance, etc.

Comment by Wei Dai (Wei_Dai) on Stephen Fowler's Shortform · 2024-05-20T09:20:59.082Z · LW · GW

It's also notable that the topic of OpenAI nondisparagement agreements was brought to Holden Karnofsky's attention in 2022, and he replied with "I don’t know whether OpenAI uses nondisparagement agreements; I haven’t signed one." (He could have asked his contacts inside OAI about it, or asked the EA board member to investigate. Or even set himself up earlier as someone OpenAI employees could whistleblow to on such issues.)

If the point was to buy a ticket to play the inside game, then it was played terribly and negative credit should be assigned on that basis, and for misleading people about how prosocial OpenAI was likely to be (due to having an EA board member).

Comment by Wei Dai (Wei_Dai) on Stephen Fowler's Shortform · 2024-05-19T20:34:55.055Z · LW · GW

Agreed that it reflects on badly on the people involved, although less on Paul since he was only a "technical advisor" and arguably less responsible for thinking through / due diligence on the social aspects. It's frustrating to see the EA community (on EAF and Twitter at least) and those directly involved all ignoring this.

("shouldn’t be allowed anywhere near AI Safety decision making in the future" may be going too far though.)

Comment by Wei Dai (Wei_Dai) on Ilya Sutskever and Jan Leike resign from OpenAI [updated] · 2024-05-17T20:23:31.097Z · LW · GW

So these resignations don’t negatively impact my p(doom) in the obvious way. The alignment people at OpenAI were already powerless to do anything useful regarding changing the company direction.

How were you already sure of this before the resignations actually happened? I of course had my own suspicions that this was the case, but was uncertain enough that the resignations are still a significant negative update.

ETA: Perhaps worth pointing out here that Geoffrey Irving recently left Google DeepMind to be Research Director at UK AISI, but seemingly on good terms (since Google DeepMind recently reaffirmed its intention to collaborate with UK AISI).

Comment by Wei Dai (Wei_Dai) on Wei Dai's Shortform · 2024-05-17T20:13:48.628Z · LW · GW

Bad: AI developers haven't taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.

Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

Even worse, apparently the whole Superalignment team has been disbanded.

Comment by Wei Dai (Wei_Dai) on quila's Shortform · 2024-05-08T06:58:48.424Z · LW · GW

These may be among the ‘most direct’ or ‘simplest to imagine’ possible actions, but in the case of superintelligence, simplicity is not a constraint.

I think it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved. In other words, if alignment was fully solved, then you could use it to do complicated things like what you suggest, but there could be an intermediate stage of alignment progress where you could safely use SI to do something simple like "melt GPUs" but not to achieve more complex goals.

Comment by Wei Dai (Wei_Dai) on Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence · 2024-05-08T06:35:39.797Z · LW · GW

Some evidence in favor of your explanation (being at least a correct partial explanation):

  1. von Neuman apparently envied Einstein's physics intuitions, while Einstein lacked von Neuman's math skills. This seems to suggest that they were "tuned" in slightly different directions.
  2. Neither of the two seem superhumanly accomplished in other areas (that a smart person/agent might have goals for), such as making money, moral/philosophical progress, changing culture/politics in their preferred direction.

(An alternative explanation for 2 is that they could have been superhuman in other areas but their terminal goals did not chain through instrumental goals in those areas, which in turn raises the question of what those terminal goals must have been for this explanation to be true and what that says about human values.)

I note that under your explanation, someone could surprise the world by tuning a not-particularly-advanced AI for a task nobody previously thought to tune AI for, or by inventing a better tuning method (either general or specialized), thus achieving a large capability jump in one or more domains. Not sure how worrisome this is though.

Comment by Wei Dai (Wei_Dai) on How do open AI models affect incentive to race? · 2024-05-07T04:38:50.813Z · LW · GW

A government might model the situation as something like "the first country/coalition to open up an AI capabilities gap of size X versus everyone else wins" because it can then easily win a tech/cultural/memetic/military/economic competition against everyone else and take over the world. (Or a fuzzy version of this to take into account various uncertainties.) Seems like a very different kind of utility function.

Comment by Wei Dai (Wei_Dai) on How do open AI models affect incentive to race? · 2024-05-07T03:42:14.869Z · LW · GW

Hmm, open models make it easier for a corporation to train closed models, but also make that activity less profitable, whereas for a government the latter consideration doesn't apply or has much less weight, so it seems much clearer that open models increase overall incentive for AI race between nations.

Comment by Wei Dai (Wei_Dai) on How do open AI models affect incentive to race? · 2024-05-07T03:04:21.086Z · LW · GW

I think open source models probably reduce profit incentives to race, but can increase strategic (e.g., national security) incentives to race. Consider that if you're the Chinese government, you might think that you're too far behind in AI and can't hope to catch up, and therefore decide to spend your resources on other ways to mitigate the risk of a future transformative AI built by another country. But then an open model is released, and your AI researchers catch up to near state-of-the-art by learning from it, which may well change your (perceived) tradeoffs enough that you start spending a lot more on AI research.

Comment by Wei Dai (Wei_Dai) on The formal goal is a pointer · 2024-05-03T03:27:57.512Z · LW · GW

What do you think of this post by Tammy?

It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it's really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they're just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.

Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases.

Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn't the same thing happen to them?

Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.

We already RL-punish AIs for saying things that we don't like (via RLHF), and in the future will probably punish them for thinking things we don't like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.

Comment by Wei Dai (Wei_Dai) on Which skincare products are evidence-based? · 2024-05-03T03:07:55.519Z · LW · GW

Retinoids, which is a big family of compounds but I would go with adapalene, which has better safety/side effect than anything else. It has less scientific evidence for anti-aging than other retinoids (and is not marketed for that purpose), but I've tried it myself (bought it for acne), and it has very obvious anti-wrinkle effects within like a week. You can get generic 0.1% adapalene gel on Amazon for 1.6oz/$12.

(I'm a little worried about long term effects, i.e. could the increased skin turnover mean faster aging in the long run, but can't seem to find any data or discussion about it.)

Comment by Wei Dai (Wei_Dai) on The formal goal is a pointer · 2024-05-02T05:09:40.370Z · LW · GW

I would honestly be pretty comfortable with maximizing SBF’s CEV.

Yikes, I'm not even comfortable maximizing my own CEV. One crux may be that I think a human's values may be context-dependent. In other words, current me-living-in-a-normal-society may have different values from me-given-keys-to-the-universe and should not necessarily trust that version of myself. (Similar to how earlier idealistic Mao shouldn't have trusted his future self.)

My own thinking around this is that we need to advance metaphilosophy and social epistemology, engineer better discussion rules/norms/mechanisms and so on, design a social process that most people can justifiably trust in (i.e., is likely to converge to moral truth or actual representative human values or something like that), then give AI a pointer to that, not any individual human's reflection process which may be mistaken or selfish or skewed.

TLDR: Humans can be powerful and overconfident. I think this is the main source of human evil. I also think this is unlikely to naturally be learned by RL in environments that don’t incentivize irrationality (like ours did).

Where is the longer version of this? I do want to read it. :) Specifically, what is it about the human ancestral environment that made us irrational, and why wouldn't RL environments for AI cause the same or perhaps a different set of irrationalities?

Also, how does RL fit into QACI? Can you point me to where this is discussed?

Comment by Wei Dai (Wei_Dai) on The formal goal is a pointer · 2024-05-02T03:26:35.007Z · LW · GW

Luckily the de-facto nominees for this position are alignment researchers, who pretty strongly self-select for having cosmopolitan altruistic values.

But we could have said the same thing of SBF, before the disaster happened.

Due to very weird selection pressure, humans ended up really smart but also really irrational. [...] An AGI (at least, one that comes from something like RL rather than being conjured in a simulation or something else weird) will probably end up with a way higher rationality:intelligence ratio, and so it will be much less likely to destroy everything we value than an empowered human.

Please explain your thinking behind this?

Dealing with moral uncertainty is just part of expected utility maximization.

It's not, because some moral theories are not compatible with EU maximization, and of the ones that are, it's still unclear how to handle uncertainty between them.