Finding the Wisdom to Build Safe AI

post by Gordon Seidoh Worley (gworley) · 2024-07-04T19:04:16.089Z · LW · GW · 10 comments

Contents

  Why would wise AI safety researchers matter?
  How did I get wiser?
  What does it mean to be wise?
  How can the right reasons be known?
  What are some wise heuristics?
  How do we find wise heuristics?
  How do we train wise humans?
  How do we know when a human is wise?
  Could we train wise AI the way we train wise humans?
  Can wisdom recognition be automated?
  Can we use Reinforcement Learning to train wisdom?
  How else might we create wise AI?
  Doesn't this plan still risk Goodharting wisdom?
  What's next?
None
10 comments

We may soon build superintelligent [? · GW] AI. Such AI poses an existential threat to humanity, and all life on Earth, if it is not aligned with our flourishing. Aligning superintelligent AI is likely to be difficult because smarts and values are mostly orthogonal [? · GW] and because Goodhart effects [? · GW] are robust [LW · GW], so we can neither rely on AI to naturally decide to be safe on its own nor can we expect to train it to stay safe. We stand a better chance of creating safe, aligned, superintelligent AI if we create AI that is "wise", in the sense that it knows how to do the right things to achieve desired outcomes and doesn't fall into intellectual or optimization traps.

Unfortunately, I'm not sure how to create wise AI, because I'm not exactly sure what it is to be wise myself. My current, high-level plan for creating wise AI is to first get wiser myself, then help people working on AI safety to get wiser, and finally hope that wise AI safety researchers can create wise, aligned AI that is safe.

For close to a decade now I've been working on getting wiser, and in that time I've figured out a bit of what it is to be wise. I'm starting to work on helping others get wiser by writing a book [? · GW] that explains some useful epistemological insights I picked up between pursuing wisdom and trying to solve a subproblem in AI alignment [LW · GW], and have vague plans for another book that will be more directly focused on the wisdom I've found [? · GW]. I thus far have limited ideas about how to create wise AI, but I'll share my current thinking anyway in the hope that it inspires thoughts for others.

Why would wise AI safety researchers matter?

My theory is that it would be hard for someone to know what's needed to build a wise AI without first being wise themself, or at least having a wiser person to check ideas against. Wisdom clearly isn't sufficient for knowing how to build wise and aligned AI, but it does seem necessary under my assumptions, in the same way that it would be hard to develop a good decision theory for AI if one could not reason for oneself how to maximize expected value in games.

How did I get wiser?

Mostly by practicing Zen Buddhism, but also by studying philosophy, psychology, mathematics, and game theory to help me think about how to build aligned AI.

I started practicing Zen in 2017. I picked Zen with much reluctance after trying many other things that didn't work for me, or worked for a while and then had nothing else to offer me. Things I tried included Less Wrong style rationality training, therapy, secular meditation, and various positive psychology practices. I even tried other forms of Buddhism, but Zen was the only tradition I felt at home with.

Consequently, my understanding of wisdom is biased by Zen, but I don't think Zen has a monopoly on wisdom, and other traditions might produce different but equally useful theories of wisdom than what I will discuss below.

What does it mean to be wise?

I roughly define wisdom as doing the right thing at the right time for the right reasons. This definition puts the word "right" through a strenuous workout, so let's break it down.

The "right thing" is doing that which causes outcomes that we like upon reflection. The "right time" is doing the right thing when it will have the desired impact. And the "right reasons" is having an accurate model of the world that correctly predicts the right thing and time.

How can the right reasons be known?

The straightforward method is to have true beliefs and use correct logic [? · GW]. Alas, we're constantly uncertain about what's true and, in real-world scenarios, the logic becomes uncomputable, so instead we often rely on heuristics [? · GW] that lead to good outcomes and avoid optimization traps. That we rely on heuristics doesn't mean facts and logic are not useful, only that heuristics are necessary to fill in their gaps in most cases. This need for heuristics suggests that the root of wisdom is finding the right heuristics that will generate good reasoning, which will in turn generate good actions that lead to good outcomes.

What are some wise heuristics?

The two that have been the most important for me are humility and kindness. By humility I mean not deceiving myself into believing my own unjustified beliefs, like by having well-calibrated beliefs [? · GW], not falling for the typical mind fallacy [? · GW], and not seeing myself as separate from reality [? · GW]. By kindness I mean a willingness to take the actions that most benefit myself and others rather than take the actions that optimize for something else, like minimizing the risk of personal suffering or maximizing the chance for personal gain.

I don't have a rigorous argument for these two heuristics, other than I tried a lot of different ones and these two have so far worked the best to help me find the right things, times, and reasons. Other heuristics might work better for others, or might be better for me, or might be better for everyone who adopts them. But, for what it's worth, humility and kindness are almost universally recommended by religions and other wisdom traditions, so I suspect that many others agree that these are two very useful wisdom heuristics, even if they are not the set of maximally useful ones.

How do we find wise heuristics?

Existing wisdom traditions, like religions, provide us with a large set of heuristics we can choose from. For any particular person looking to become wiser, the problem of finding wise heuristics is mostly one of experimentation. A person can adopt one or more heuristics, then see if those heuristics help them achieve their reflexively desired outcomes. If not, they can try again with different heuristics until they find ones that work well for them.

We collectively benefit from these individual experiments. For millennia our ancestors ran similar experiments with their lives and have passed on to us the wisdom heuristics that most reliably served them well. Thus the set of heuristics they've provided us with have already been tested and found effective.

There may be other, better wisdom heuristics that cultural evolution could not find, but we should expect no more than marginal improvements over existing heuristics. That's because most wisdom heuristics are shared as simple, fuzzy concepts, so any "new" heuristics are not going to be clearly distinct from existing ones. For example, if someone were to propose I replace my heuristics of humility and kindness with meekness and goodwill, they would have to explain how meekness and goodwill are more than restatements of humility and kindness such that I would have reason to adopt them. Even if these heuristics were different, they would likely only be different on the margin, and would be unlikely to offer Pareto improvements [? · GW] over my existing heuristics.

Therefore I expect that, for the most part, we've already adequately explored the space of wise heuristics for humans, and the challenge of becoming wise is not so much in finding wise heuristics as it is in learning how to effectively apply the ones we already know about.

How do we train wise humans?

I don't know all the ways we might do it, but here's a rough outline of how we train wisdom in Zen.

A student works closely with a teacher over many years. That work includes thousands of hours of meditation, instruction from their teacher in both meditation and ethics, and putting that instruction into practice where the teacher can observe and offer corrections. One purpose of this work is to help the student wake up to the reality of life and free themselves from suffering ("enlightenment"), which requires the cultivation of "wisdom beyond wisdom". The result is that the teacher is said to transmit their wisdom mind-to-mind over the course of this training period.

I'm not going to claim that Zen teachers know how to psychically transmit thoughts directly from one mind to another. Instead, they are slowly and gradually training the student to develop the same generators of thought and action as they have. Thus, rather than training the student to appear wise and enlightened, they are attempting to remake the student into the type of person who is wise and enlightened, and thereby avoid Goodharting wisdom.

As will perhaps be obvious, this is not reliably possible, and yet Zen has managed to transmit its wisdom from one generation to the next without collapsing from runaway Goodharting, so its methods must work to some extent.

How do we know when a human is wise?

Again, I'll answer from my experience with Zen.

It is generally up to a Zen teacher to testify to the wisdom of their students. This is achieved through the rituals of jukai and dharma transmission. In jukai, a student takes vows to behave ethically and follow the wisdom of teachings, and it comes after a period of training to learn to live those vows. Later, a student may receive dharma transmission and permission to teach if their teacher deems them sufficiently capable and wise, creating a lineage of wisdom certification stretching back hundreds of years.

What's important about these processes is that they place the authority to recognize wisdom in another person. That is, a person is not a reliable judge of their own wisdom because it is too easy to self-deceive, so we instead rely on another person's judgement. And the people who are best going to be able to recognize wisdom are those who are wise themselves because others judge them to be so.

Thus we know someone is wise because other people, and wise people in particular, can recognize wisdom in others.

Could we train wise AI the way we train wise humans?

Maybe. It would seem to mostly depend on how similar AI are to us, and to what extent training methods that work on humans would work on AI. Importantly, the answer will hinge on whether or not AI can be trained without Goodharting on wisdom, and whether or not we can tell if an AI has Goodharted wisdom rather than become actually wise.

If we can train AI to be wise, it would imply an ability to automate training, because if we can train a wise AI, then in theory that AI could train other AIs to be wise in the same way wise humans are able to train other humans to be wise. We would only need to train a single wise AI in such a scheme who could pass on wisdom to other AIs.

Can wisdom recognition be automated?

I'm not sure. Automation generally requires the use of measurable, legible signals, but in Zen we mostly avoid legibility. Instead we rely on the conservative application of intuitive pattern matching, like a teacher closely observing a student for years. My theory is that this is a culturally evolved defense against Goodharting.

In theory, it might be possible to train an LLM to recognize wisdom in the same way a Zen teacher would, but it would require first finding a way to train this LLM in Zen with a teacher who would be willing to give it dharma transmission. I'm doubtful that we can use training methods that work on humans, but they do offer inspiration. In particular, I suspect Zen's model of mind-to-mind transmission only works because, typical mind fallacy not withstanding, some people really do think very similarly, and when a student is similar enough to what the teacher was like prior to their own training, the teacher is able to train the student in the same way they were trained and be largely successful. In short, training succeeds because student and teacher have sufficiently similar minds.

It's always possible that the path to superintelligent AI will pass through designs that closely mimic human minds, but that seems unlikely given we've made tremendous AI progress already with non-human-like designs. Thus it's more likely that, if we were to attempt training wisdom into AIs, we would need to look for ways to do it that would generalize to minds not like ours.

Can we use Reinforcement Learning to train wisdom?

I'm doubtful that we can successfully train wisdom using known RL techniques. The big risk with RL is Goodharting, and I don't see signs that we've found RL methods that are likely to be sufficiently robust to Goodharting under the extreme optimization pressure of superintelligent AI. At best we might be able to use RL to train wise AI that helps us to build wise superintelligent AI, but would be inadvisable to use RL to directly train a superintelligent AI to be wise.

How else might we create wise AI?

I don't have a solid answer, but given the ultimate goal is to create superintelligent AI that is aligned with human flourishing, there might be a way to use relatively wise AI to help us bootstrap [LW · GW] a safe, superintelligent AI.

One way this could go is the following:

  1. We train an LLM to be an expert on AI design and wisdom. We might do this by feeding it AI research papers and "wisdom texts", like principled arguments about wise behavior and stories of people behaving wisely, over and above those base models already have access to, and then fine tuning to prioritize giving wise responses.
  2. We simultaneously train some AI safety researchers to be wiser.
  3. Our wise AI safety researchers use this LLM as an assistant to help them think through how to design a superintelligent AI that would embody the wisdom necessary to be safe.
  4. Iterate as necessary, using wisdom and understanding developed with the use of less wise AI to train more wise AI.

This is an extremely hand-wavy plan, so I offer it only as inspiration. The actual implementation of such a plan will require resolving many difficult questions such as what research papers and wisdom texts the LLM should be trained on, which AI safety researchers are wise enough to succeed in making progress towards safe AI, and when enough progress will have been made that superintelligent AI can safely be created.

Doesn't this plan still risk Goodharting wisdom?

Yep! As with many problems in AI safety, the fundamental problem is preventing Goodharting. The hope I hold on to is that people sometimes manage to avoid Goodharting, such as when a Zen teacher successfully transmits wisdom to their students. Based on such examples of non-Goodharting training regimes, we may find a way to train superintelligent AI that stays safe because it doesn't succumb to Goodhart Curse or other forms of Goodharting [LW · GW].

What's next?

Personally, I'm going to continue to focus on helping myself and others get wiser. I seriously doubt it's the most impactful thing we can do to ensure the creation of safe, aligned, superintelligent AI, but it's the most impactful thing I expect to be able to make progress on right now.

As for you and other readers, I see a few paths forward:

  1. Work on getting wiser yourself.
  2. Share the wisdom you have with others.
  3. Work on training LLMs that not only understand wisdom, but robustly apply it.
  4. Look for ways we might create AIs that could train other AIs to be wise.
  5. Figure out how AI safety research can outpace capabilities progress such that we have the time needed to figure out how to build wise AIs, or more generally how to create AI that is aligned with our flourishing.

Thanks to Owen Cotton-Barrett and Justis Mills for helpful comments on earlier drafts.

10 comments

Comments sorted by top scores.

comment by Chris_Leong · 2024-07-04T23:02:25.541Z · LW(p) · GW(p)

My intuition is that the best way to build wise AI would be to train imitation learning agents on people who we consider to be wise. If we trained imitations of people with a variety of perspectives, we could then simulate discussions between them and try to figure out the best discussion formats between such agents. This could likely get us reasonably far.

The reason why I say imitation learning is because that would give us something that we could treat as an optimisation target which is what we require for training ML systems.

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2024-07-05T15:36:21.097Z · LW(p) · GW(p)

Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.

Replies from: Chris_Leong
comment by Chris_Leong · 2024-07-06T00:24:58.092Z · LW(p) · GW(p)

You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.

comment by Charbel-Raphaël (charbel-raphael-segerie) · 2024-07-06T08:45:06.992Z · LW(p) · GW(p)

It might not be that impossible to use LLM to automatically train wisdom:

Look at this: "Researchers have utilized Nvidia’s Eureka platform, a human-level reward design algorithm, to train a quadruped robot to balance and walk on top of a yoga ball."

comment by Anders Lindström (anders-lindstroem) · 2024-07-05T12:50:54.645Z · LW(p) · GW(p)

Are we really sure that we should model AI's in the image of humans? We can apparently not align people with people, so why would a human-replica be that different? If we train an AI to behave like a human, why do we expect the AI to NOT behave like a human? Like it or not, but part of what makes us human is lying, stealing, and violence.

"Fifty-two people lost their lives to homicide globally every hour in 2021, says new report from UN Office on Drugs and Crime". https://unis.unvienna.org/unis/en/pressrels/2023/uniscp1165.html

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2024-07-05T15:28:03.694Z · LW(p) · GW(p)

I'm not sure what this comment is replying to. I don't think it's likely that AI will be very human-like, nor do I have special reason to advocate for human-like AI designs. I do note that some aspects of training wise AI may be easier if AI were more like humans, but that's contingent on what I consider to be the unlikely possibility of human-like AI.

comment by Jonas Hallgren · 2024-07-05T09:09:11.913Z · LW(p) · GW(p)

I resonated with the post and I think it's a great direction to draw inspiration from!

A big problem with goodharting in RL is that you're handcrafting a utility function. In the wisdom traditions, we're encouraged to explore and gain insights into different ideas to form our utility function over time.

Therefore, I feel that setting up the right training environment together with some wisdom principles might be enough to create wise AI.

We, of course, run into all of the annoying inner alignment and deception whilst training style problems, yet still, it seems the direction to go in. I don't think the orthogonality thesis is fully true or false, it is more dependent on your environment and if we can craft the right one I think we can have wise AI that wants to create the most loving and kind future imaginable.

comment by alex.herwix · 2024-07-05T08:45:39.478Z · LW(p) · GW(p)
  • We train an LLM to be an expert on AI design and wisdom. We might do this by feeding it AI research papers and "wisdom texts", like principled arguments about wise behavior and stories of people behaving wisely, over and above those base models already have access to, and then fine tuning to prioritize giving wise responses.
  • We simultaneously train some AI safety researchers to be wiser.
  • Our wise AI safety researchers use this LLM as an assistant to help them think through how to design a superintelligent AI that would embody the wisdom necessary to be safe.
  • Iterate as necessary, using wisdom and understanding developed with the use of less wise AI to train more wise AI.

First, I wanted to dismiss this as not addressing the problem at all but on second thought, I think a key insight here may be that adding a focus on improving the wisdom of relevant parties involved in AI development could help to bootstrap more trustworthy "alignment verification" capacities. 

However, I am not sure that something like this would fly in our economically oriented societies since I would expect that wiser people would decline to develop super-intelligent AI for the foreseeable future and rather urge us to look inward as the space to look for solutions to most of our problems (almost all of our problems are man-made after all). Having said this, if we were to get a regime in place that could reliably ensure that "wisdom" plays a key role in decision making around AI development, this seems like good a bet as any to help us deal with our predicament.  

comment by alex.herwix · 2024-07-05T08:21:35.994Z · LW(p) · GW(p)

If we can train AI to be wise, it would imply an ability to automate training, because if we can train a wise AI, then in theory that AI could train other AIs to be wise in the same way wise humans are able to train other humans to be wise. We would only need to train a single wise AI in such a scheme who could pass on wisdom to other AIs.

I think this is way too optimistic. Having trained a wise person or AI once does not mean that we have fully understood what we have done to get there, which limits our ability to reproduce it. One can maybe make the argument that in the context of fully reproducible AI training pipelines recreation may be possible or that a wise AI could be copied but we shouldn't simply assume this. The world is super complex and always in motion. Nothing is permanent. What has worked in one context may not always work in an other context. Agents which were considered wise at some point may not be at another or agents which have actually been wise in hindsight may not be recognized as such at the time.

In addition, producing one wise AI does not necessarily imply that this wise AI can effectively pass on wisdom at the required scale. It may have a better chance than non-wise AIs but we shouldn't take success as a given, if all we have managed is to produce one wise AI. There are many forces at play here that could subvert or overcome such efforts, in particular in race situations.

My gut feeling is that transmission of wisdom is somewhat of a coordination game that depends on enclaves of relatively wise minds cross checking, challenging, and supporting each other (i.e., Thich Nhat Hanh's “the next Buddha will be a Sangha”). Following this line of logic, the unit of analysis should be the collective or even ecology of minds and practices rather than the "single" wise AI. I acknowledge that this is more of an epistemic rather than ontological distinction (e.g., one could also think of a complex mind as a collective as in IFS) but I think it's key to unpack the structure of wisdom and how it comes about rather than thinking of it as "simply" a nebulous trait that can and needs to be copied.

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2024-07-05T15:34:22.665Z · LW(p) · GW(p)

This is a place where my Zen bias is showing through. When I wrote this I was implicitly thinking about the way we have a system of dharma transmission that, at least as we practice Zen in the west, also grants teaching authorization, so my assumption was that if we feel confident certifying an AI as wise, this would imply also believing it to be wise and skilled enough to teach what it knows. But you're right, these two aspects, wisdom and teaching skill, can be separated, and in fact in Japan this is the case: dharma transmission generally comes years before teaching certification is granted, and many more people receive transmission than are granted the right to teach.