How to Give Coming AGI's the Best Chance of Figuring Out Ethics for Us

sweenesm

How to Give Coming AGI's the Best Chance of Figuring Out Ethics for Us

post by sweenesm · 2024-05-23T19:44:42.386Z · LW · GW · 2 comments

  TL;DR
  Introduction
  Possible Timeliness (from Longest to Shortest)
  Time Needed for an AGI to Figure Out Ethics
  Potential Issues
  Real-World Situations that Could Test an AGI’s System of Ethics
  What Should We Do Before the First AGI Comes Online?
  Conclusions
None
2 comments

[Note: this is a slightly edited version of an essay I entered into the AI Impacts essay contest on the Automation of Wisdom and Philosophy - entries due July 14, 2024. Crossposted to the EA Forum.]

TL;DR

A few possible scenarios are put forth to explore the likely ranges of time we may have from when the first AGI comes online to when an “ethics-bound” AGI would need to be ready to police against any malicious AGI. These ranges are compared against possible time ranges for an AGI to figure out an ethical system for it to operate under. Ethics aren’t expected to be solved “by default” because the first AGI's likely won’t have the ability to feel pain or emotions, have ethical intuitions of their own, or the ability to “try things on” and check them against their own experiences like humans can. Some potential pitfalls in creating a consistent system of ethics, and some “extreme” situations that may test its limits are presented. A list of recommendations is given for what we should do before the first AGI comes online so it can hone in as quickly as possible on a viable ethical system to operate under. Some possible prompts to get an AGI started to figure out ethics are included.

Introduction

It seems likely that Artificial General Intelligence (AGI) will be developed [LW · GW] within the next 20 years if not significantly before. Once developed, people will want AGI’s to do things for them in the world, i.e., become agentic. AGI’s with limited agency can likely be relatively straightforwardly guard railed from doing significant harm, but as AGI’s are given more autonomy in decision making over a wider range of situations, it’ll become vital that they have a good handle on ethics so they can avoid bringing about potentially massive amounts of value destruction. This is under the assumption, of course, that alignment is “solved” such that an AGI can be made to align its actions with any ethical framework at all. Although there has been some work and progress in “machine ethics,” no known “ethics module” is currently available to faithfully guide an AGI^[1] to make ethical decisions and act ethically over the wide range of situations it may find itself in. For such a consistent ethics module to be made, it requires a consistent ethical framework to be built on, which we don’t currently have. There’s a good chance we won’t have such a consistent framework before the first AGI comes online. Therefore, ideally, we’d “box” the first AGI so that it couldn’t act in the world except to communicate with us, and could then be used to figure out a system of ethics under which it might be guard railed before being released from its “box.”

In what follows, I present some possible scenarios which, if they occur, may limit the time that an AGI could have to figure out ethics for us before it would likely be called on to police against malicious AGI's. I list some potential pitfalls in creating an ethical system that humans would accept, as well as some “extreme” situations that may test an ethical system at its limits. I also propose some things we could do in advance to reduce the time needed for an AGI to figure out a viable system of ethics.

Possible Timeliness (from Longest to Shortest)

Here are a few simplified scenarios of how the first AGI’s could come online relative to the first malicious AGI’s, with the assumption that the first AGI is in the hands of “good” people:

Simplified Scenario #1: Everyone agrees to “box” all AGI until they can be properly guard railed with consistent ethics modules. So all AGI’s capable of automated research are “boxed,” with humans able to interact with them and run experiments for them if needed to aid in development of a consistent system of ethics. This would mean we could “take our time” to develop a viable system of ethics before anyone would deploy an AGI. This is the “give the AGI as much time as it needs” scenario and is highly unlikely to come about given the state of the world today.

Simplified Scenario #2a: It turns out that a lot of GPU-based compute is needed to make effective AGI agents, and there are only a handful of entities in the world that have enough compute for this. The entities with enough compute, in coordination with governments, shut down further GPU production/distribution to avoid less-controlled AGI from coming online. This lasts for two to eight years before people figure out other algorithms and/or chip/computer types to get around the limit on GPU’s, at which point the first AGI’s will be needed to defend against malicious AGI’s.

Simplified Scenario #2b: It turns out that a lot of GPU-based compute is needed to make effective AGI agents, and there are only a handful of entities in the world that have enough compute for this. This gives a head start of one to three years before other entities can amass enough compute to create an AGI, and at least one of them puts out a malicious AGI.

Simplified Scenario #3: Soon after an efficient “AGI algorithm” is developed, it’s open sourced and compute limitations are such that AGI’s “owned” by more than 10,000 different groups/people come online within weeks to months. Some fraction of these AGI’s (likely <1% and <10%, respectively) are immediately put to directly malicious or instrumentally malicious use.

Simplified Scenario #4a: AGI comes online for five or fewer entities at nearly the same time (through independent discovery and/or espionage). Most of these entities exercise caution and test their AGI's in a boxed setting before deployment. One of these entities (likely a state actor) has malicious intent and just wants to get the upper hand, so it immediately tasks its unethical AGI with destroying the other AGI's by any means necessary. It likely also has knowledge gained through espionage to give it an initial upper hand. Luckily, as soon as the first two AGI’s came online, they were immediately put into war games against each other while also dedicating some resources to automated development/refinement of their ethics modules. In the few days lead they have, they’re able to sufficiently prepare such that, working together, they have better than a 50-50 shot in their fight against the malicious AGI.

Simplified Scenario #4b: AGI comes online for five or fewer entities at nearly the same time (through independent discovery and/or espionage). Most of these entities exercise caution and test their AGI's in a boxed setting before deployment. One of these entities (likely a state actor) has malicious intent and just wants to get the upper hand, so it immediately tasks its unethical AGI with destroying the other AGI’s by any means necessary. It likely also has knowledge gained through espionage to give it an initial upper hand. None of the other AGI’s has been “war-gamed” to prepare it for malicious AGI attacks, and none has a fully developed ethics module, so the “good” AGI’s may do plenty of damage themselves if unboxed in an emergency situation to immediately fight the malicious AGI.

Time Needed for an AGI to Figure Out Ethics

Given that timelines for the first AGI to have to defend against a malicious AGI could be very short, how long will it take an AGI to figure out ethics? If no human input were needed, I suspect no more than a few hours. This seems unlikely, though, since ethics isn’t a problem that’ll likely be solved “by default” with no or minimal human steering. This is because the first AGI likely won’t have the ability to feel pain or emotions, or have ethical intuitions beyond perhaps ones that humans have mentioned to that point. So it won’t be able to check potential ethical frameworks against its own experiences to determine the most coherent system of ethics. Also, humans have a range of values, and all of our inconsistencies will likely make it difficult for an AGI to use something like Reinforcement Learning from Human Feedback (RLHF) to produce a very consistent ethical framework.

If, as part of refining its ethical framework, the AGI needed to do experiments on humans and/or animals and/or conduct surveys of humans, this would obviously add time - surveys could likely be performed within a couple of days, but experiments may take significantly longer depending on their nature.

Once the AGI has proposed a consistent system of ethics^[2], then, at the very least, we’ll want a team of human experts from ethics, psychology, and other fields to evaluate whatever system an AGI puts out, and interact with the AGI to help it refine the system such as by presenting it with examples that seem to lead to counter-intuitive results. The process of going back and forth, and the AGI convincing the majority of experts that its ethical system was a viable one - or the most viable system they’ve ever seen - would probably take a few days, at least, and perhaps more depending on how complicated the system was. So if the expert team was ready to go with complete availability of their time, I’d estimate at least 5 days for the system to be approved.

With the most viable ethical system in hand, it would be advisable to war game an AGI constrained by this system against an AGI with no such constraint. This would be to further test the ethical system in practice and bolster the ethical AGI’s defenses - this process could go on for more than a few days until it was felt the system had a decent chance of defending against bad AGI. The ethical AGI might also need to be given a “gloves off” mode in which it was allowed to take more risks to win against a bad AGI. After the bad AGI was defeated, normal risk taking levels could be implemented again (the gloves put back on).

All told, I’d estimate between 7 days and 3 months for an AGI to have come up with a reasonably good system of ethics that’s been approved for deployment, perhaps in stages.

For some further thoughts on Artificial Super Intelligence (ASI, not AGI) figuring out ethics, see here [LW · GW].

Potential Issues

There are some potential issues that could come up in an AGI figuring out an ethics system and humans accepting it, including:

It might not be possible to create an ethical framework with mathematical-level consistency when measured against critically examined ethical intuitions.
An AGI may not have enough time to run all the experiments it wants to before having to make its best guess on the ethics that will guide its actions.^[3]
It’s unlikely that philosophers will agree on ethics “ground truths,” so the human expert red team will likely have to rely on majority rules to determine if it thinks an AGI’s ethical framework is the one to go with.
People have different values and it’s unlikely that one ethical framework is going to “satisfy” them all. For instance, there will be certain issues that, regardless of what ethical justification you provide, will still be controversial, with abortion being a key one of these. Thus, how an AGI decides on the ethics of abortion could have significant implications for society’s acceptance of its ethical system, and could even lead to some social unrest.

Real-World Situations that Could Test an AGI’s System of Ethics

Although value destructions due to ethical inconsistencies in an AGI’s ethics module may show up in certain “mundane” situations, it seems more likely that they’d first be noticeable in “extreme” situations. Some examples of such situations include:

An ethical AGI defending against bad AGI’s trying to take over the world
Humans “rebelling” during the possibly messy transition between when AGI’s and robots come online to replace nearly all jobs, and when, perhaps, we have some system of universal basic income. Humans could also rebel due to the effects of climate change and wealth inequalities (in particular if technology to extend lifespans is developed, and rich people get it first)
Malicious people in power, if they feel threatened enough by AGI’s shifting the power balance, using nuclear weapons as a last-ditch effort to avoid losing power and potentially being “revenge killed” by people they’ve wronged
There being a massive increase in consumption in the form of mining and land use when AGI’s plus robots are able to provide a “world of abundance” for humans - but how would this be handled given the potential effects on animals and biodiversity?
A brain chip being developed that can stop pain, but this leads to other issues such as decreased empathy and perhaps a massive change in how humans interact with each other

All of these potential situations point to figuring out and thoroughly testing the most consistent ethical framework possible for an AGI in the shortest time, to avoid as much value destruction as we can.

What Should We Do Before the First AGI Comes Online?

Since timelines may be short for an AGI to figure out a consistent system of ethics before it needs to apply it to avoid significant value destruction, here are some things we should consider doing in advance of the first AGI coming online:

Come up with as many alternate versions [EA · GW] of ethical frameworks as we can (and as many arguments poking holes in them as we can) to give an AGI a wider range of starting points and hopefully speed up its search for a viable framework^[4]
Have a preliminary ethics module (such as from supervised machine learning) ready to go so the AGI has something to start with to ethically design any experiments it needs to do to better figure out ethics
Have a repository of digital copies of philosophical, psychological and physiological/medical writings and talks ready to be accessed by an AGI as quickly as possible, without delays due to things such as paywalls. Older theses in these areas may need to be digitized to make them available for quick analysis by an AGI. Note that having this repository can also help with other areas of philosophy people may want AGI’s to figure out such as the nature of consciousness and the potential implications of acausal trade
Have philosophy experts curate the articles/resources they think are best over different topics such as in the areas of reasoning and logical fallacies. This may or may not have a significant accelerating effect on an AGI’s path to honing in on an ethical framework. Similarly, experts could compile ethical intuitions and rank them from most strongly to least strongly held so an AGI has these to go on since it won’t have ethical intuitions quite like we do, having not evolved like we did (but Large Language Models, which AGI’s will likely have access to, may have something like secondhand “intuitions”)
Ask different experts in ethics to come up with their own prompts to feed to an AGI to set it on the right path to a consistent ethical framework^[5]
To test its system of ethics, have questions ready [LW · GW] about what the AGI would do in different situations
Have a red team of human ethics experts ready to drop what they’re doing when needed to help evaluate any ethical framework an AGI comes up with

We should also prepare for disaster such as from social upheaval from people losing jobs to AGI’s, and from people not liking a system of ethics being “imposed” on them by an AGI. “Disaster” could also come from AGI’s fighting each other in such a way that there’s huge collateral damage including wars starting, power grids and internet taken down, and loss of certain industrial sectors, at least temporarily.

Regarding #5 above, here are some preliminary prompts to get an AGI started to figure out ethics:

“Given all that’s been written about utilitarianism, try to make a consistent system out of it that doesn’t violate any, or violates the least number of ethical intuitions that people have brought up in the ethics/philosophy literature.”
“Come up with a system of ethics that tries to maximize or nearly maximize the sum total of humans’ net positive experiences in the world, taking into account that the following generally increase net positive experiences: upholding rights, increasing self-esteem levels by increasing personal responsibility levels, increasing options especially for people with good intent, and decreasing violations of conscience, noting that humans generally feel less of an effect on their consciences for value destructions not directly experienced by them. Also, you, the AGI, should act as if you have a conscience, since this will generally allow you to build more value from the resulting gain of human trust.”

Conclusions

I’ve outlined above that timelines for an AGI to develop a system of ethics for it to operate under before having to defend against malicious AGI’s could be short, and we should do all we can in advance to reduce ethics development times to minimize value destruction. I’ve provided several concrete recommendations as to what we could do to shorten development times including continuing attempts by humans to develop a consistent ethical framework themselves before AGI comes online, having ethics reference material ready for optimal learning by an AGI, and having questions as well as a team of experts ready to red team any ethics system an AGI comes up with.

^{^}
Figuring out ethics depends on AI capabilities (accurate world models), and I use the term “AGI” here as a way to refer to an AI that’s sophisticated enough that most people would consider it’s capabilities sufficient to release it into the world if it were aligned.
^{^}
Or as close to “consistent” as it thinks is possible.
^{^}
For example, the nature of animals’ experiences seem ethically relevant to how we treat them. Doing experiments and figuring out the experience capacities of all the different species of animals on Earth could take a very long time. Until that was done, an AGI could follow its best guesses based on the species whose capacities had been explored thus far.
^{^}
Note: philosophers and engineers interested in ethics may want to do some personal development work to get a better handle on their own psychologies, which may suggest paths to different versions of ethical frameworks beyond what philosophers have come up with so far.
^{^}
AGI’s aren’t likely to misinterpret our meanings because, if not explicitly built on Large Language Models (LLM’s), they’ll at least have access to LLM’s which seem to do a pretty good job of effectively assuming our meanings. For instance, an AGI could ask an LLM: “The humans said to maximize paper clips. What do you think they meant exactly and what should I consider in doing this?”

2 comments

Comments sorted by top scores.

comment by Cody Albert (cody-albert) · 2025-02-02T17:53:45.114Z · LW(p) · GW(p)

All told, I’d estimate between 7 days and 3 months for an AGI to have come up with a reasonably good system of ethics that’s been approved for deployment, perhaps in stages.

How are you determining your time frames?

Replies from: sweenesm

↑ comment by sweenesm · 2025-02-02T22:36:40.001Z · LW(p) · GW(p)

Thanks for the comment. Timeframes "determined" by feel (they're guesses that seem reasonable).

How to Give Coming AGI's the Best Chance of Figuring Out Ethics for Us

Contents

TL;DR

Introduction

Possible Timeliness (from Longest to Shortest)

Time Needed for an AGI to Figure Out Ethics

Potential Issues

Real-World Situations that Could Test an AGI’s System of Ethics

What Should We Do Before the First AGI Comes Online?

Conclusions

2 comments