Creating a “Conscience Calculator” to Guard-Rail an AGI

sweenesm

Creating a “Conscience Calculator” to Guard-Rail an AGI

post by sweenesm · 2024-08-12T16:03:30.826Z · LW · GW · 0 comments

  Introduction
  An AGI’s Decision Making Procedure
  Constructing a Calculable “Ideal” Conscience
  Lexicality of Conscience
  How a Conscience-Bound AGI Might Act with Users
  Follow Conscience or Follow the Law?
  Conclusions
  Future Work
None
No comments

[Crossposted to the EA Forum here [EA · GW].]

TL;DR: I present initial work towards creating a “conscience calculator” that could be used to guard-rail an AGI to make decisions in pursuing its goal(s) as if it had a human-like conscience. A list of possible conscience breaches is presented with two lexical levels, i.e., two levels within which different breaches can override each other depending on their severity, but between which, breaches from the lower level can never override breaches from the higher level. An example of this would be that it could feel better for your conscience to lie continuously for the rest of your life than to murder one person. In the future, “conscience weight” formulas will be developed for each breach type so that an AGI can calculate what’s the least conscience breaching decision to take in any situation where a breach has to be made, such as in ethical dilemmas.

Introduction

I’ve been developing an “ethics calculator [LW · GW]” based on a non-classic utilitarian framework [EA · GW] to enable an Artificial General Intelligence (AGI) to calculate how much value its actions may build/destroy in the world. My original thought was that such a calculator could be used on its own to guide an AGI’s decisions, i.e., to provide an AGI with an ethical decision making procedure. In this post I’ll talk about some issues with that approach, and describe a “conscience calculator” that could be used to “guard-rail” an AGI’s decisions when the AGI is pursuing some goal, such as maximizing value according to the “ethics calculator” I’ve just mentioned.

An AGI’s Decision Making Procedure

Before having thought things through thoroughly, I wrote the following about decision making procedures using a utilitarian framework [EA · GW] I’ve been developing: “For decisions that do involve differences in rights violations, the AGI should either choose the option that’s expected to maximize value, or some other option that’s close to the maximum but that has reduced rights violations from the maximum.” I also wrote: “For decisions that did not involve differences in rights violations, the AGI could choose whichever option it expected that its human user(s) would prefer.”

At least two issues arise from this sort of decision making procedure. The first has to do with the meaning of “involve differences in rights violations” - one could argue that there are always finite risks, no matter how small, of rights violations in any situation. A second issue is that the above-described decision making procedure involves some of the problems with non-person-affecting views, as written about, for example, by T. Ajantaival. For instance, if I assumed the value weight of my conscience was finite, then, if I could bring a sufficiently large number of happy people into existence by torturing one person, I should do it according to the above decision making procedure. My conscience doesn’t agree with this conclusion, i.e., it goes against my moral intuitions, as I expect it would most people’s.^[1] Another moral intuition I have is that I’d rather push a button to lightly pinch any number of people than not push the button and end up killing one person, even if the total value destruction from the light pinching of huge numbers of people appears to add up to more than that from killing one person. These two scenarios demonstrate that I, as do I think most humans, make decisions based on my own conscience, not on apparent expected value maximization for the world in general.

Making decisions based on conscience enables more trust between people because when someone demonstrates to me that they consistently act within conscience bounds, I generally assume with reasonable certainty that they’ll act within conscience bounds with me. Therefore, if we want to create AGI that people will feel they can trust, a reasonable way to do this would seem to be to guard-rail an AGI's decisions with a human-like conscience, i.e., one that humans can intuitively understand. Also, giving AGI’s a calculable conscience could enable more trust and cooperation between AGI’s, in particular if their conscience guard rails were identical, but also if they were simply transparent to each other.

The decision making procedure for an AGI based on conscience could be to come up with different possible paths to pursue its goal(s), then calculate the conscience weights of any conscience breaches expected along those paths, and choose a path with either no significant risk of conscience breaches, or a path with the minimum total conscience weight if conscience breaches of some form don’t seem to be avoidable, as in the case of ethical dilemmas.

Constructing a Calculable “Ideal” Conscience

I’m going to assume that AGI’s will have the ability to form a good world model, but won’t be able to feel physical or emotional pain - in other words, they’ll be able to and will have to rely on a calculable “conscience” to approximate a human conscience. A human conscience is, of course, based on feel - it feels bad for me to do destructive acts. So to construct a calculable conscience, I’ve relied on what feels right and wrong to me, and to what degree, by accessing my own conscience when thinking about different situations. I’ve then extrapolated that to what I’ll call an “ideal" conscience. The extrapolation process involves making sure that what my conscience tells me is consistent with reality. Ideally, I should only have conscience around things I actually cause or have some responsibility for. Also, my conscience should involve some consideration of the relative value destructions and builds of my actions - in particular, whether they promote people being less of more responsible. People taking more responsibility generally results in higher self-esteem levels and more well-being, or, said differently, the more one feels like a victim (doesn’t take full responsibility for themselves), the worse their life experience generally is. In this way, promoting responsibility is promoting value in the world, as measured by longterm human well-being. That said, an ideal conscience does not follow a classic utilitarian form in that it’s not just about maximizing everyone’s well-being. Apparent utilitarian value changes are a factor in an ideal conscience, just not the only factor. For example, my conscience first says not to murder one innocent person to save 5 others, then secondarily it tells me to consider relative life values such as if the person I have to murder to save 5 children only has minutes to live anyway.

I don’t put forward my resulting “ideal" conscience as the one and only true version that everyone would arrive at if they thought long and hard enough on it. I present it as a starting point which can be further refined later. I believe, however, that we should have some version of a conscience calculator ready to go as soon as possible so it can be tested on systems as they get closer and closer to AGI. If an AGI comes online and is “let loose” without a reliable conscience calculator onboard to guard-rail it, I believe the consequences could be quite bad [EA · GW]. I also personally don’t see any of the current “bottom up” approaches (machine learned ethics based on human feedback/annotation) as being sufficient to generalize to all situations that an AGI may encounter out in the world.

What I present below is the start of constructing a conscience calculator: I provide a list of conscience breaches and their lexical levels. By lexical levels (see this post by M. Vinding, this post by the Center on Long-Term Risk, and/or this one by S. Knutsson), I mean that conscience breaches in a lower lexical level can never outweigh conscience breaches in a higher lexical level, such as how light pinches to any number of people never outweigh the murder of one person.^[2] The next step in constructing a working conscience calculator will be to provide “conscience weight” formulas for each breach so that comparisons can be made between breaches on the same lexical level. Assigning conscience weight values will involve some consideration of value change weights for a given action, as I’ve already been developing for an “ethics calculator.”

Lexicality of Conscience

For constructing a calculable “ideal” conscience, I use two lexical levels that I’ll call “level 0” and “level 1.” For conscience breaches of negligible weight, I use a third level I call "level -1," although one could simply leave these off the list of breaches entirely. At least five factors could be considered as affecting the lexical level of a conscience breach: 1) pain level, 2) risk level, 3) responsibility level of the breacher, including level of self-sacrifice required to avoid "passive" breaches^[3], 4) intent, and 5) degree of damage/repairability of the damage (whether the breach involves death and/or extinction).

Appendix A provides a first attempt at lists of “ideal” conscience breaches at different lexical levels for an AGI. Some conscience breaches are not explicitly on the list, such as discriminating against someone based on their race when hiring for a job - this could be considered a combination of the conscience breaches of setting a bad example, stealing, being disrespectful, lying/misleading, and possibly others.

Looking at the list in Appendix A, it may seem that in certain cases, items on the lexical level 0 list could be considered so severe as to be at lexical level 1, e.g., stealing a starving person’s last bit of food. However, the act of stealing itself would still be at lexical level 0, while the “secondary effect” of the stealing, i.e., putting someone’s life at major risk of death or serious pain from starvation would represent the lexical level 1 part to consider for this particular action.

The breaches at lexical level -1 are taken to have negligible conscience weight because they’re offset by other breaches that are either required to avoid them or are more likely when effort is put towards avoiding them. One such breach could be discouraging responsibility in others - this could happen by taking on responsibility for small things that others have the most responsibility for (i.e., themselves) and you have only tiny responsibility for. Also, some pain is necessary for people’s growth and building self-esteem, and for them to appreciate pleasure, so we should have conscience around helping people avoid too much minor pain, since this could be bad for them. Further, some things are so small in conscience that to consider them distracts away from bigger things, and we should have conscience around focusing too much on small things to the point of risking worse things happening due to negligence.

Some utilitarians may argue that there should not be a difference in one’s “moral obligation” to save a life right in front of you versus a life you don’t directly perceive/experience. In the context of conscience, a “moral obligation” could be thought of as something we should do to avoid going against our conscience. Accessing my own conscience, at least, it seems like I’d feel worse if I didn’t save someone right in front of me than if I didn’t save someone I don’t directly perceive (such as someone far away). Also, in terms of how conscience facilitates building trust between people, which in turn facilitates building more value, do you tend to trust someone more who saves people right in front of them or who’s always worried about saving people far away, possibly to the detriment of those right in front of them? Thus, an argument could be made that more value is upheld in the world (more value is built/less is destroyed) due to the building of trust when people have more conscience around helping people close to them than far away.

I assigned causing animals major pain to lexical level 1 and killing animals to lexical level 0. At the same time, I assigned killing humans to lexical level 1. I believe there's a significant difference in weight between killing humans and killing animals since I see humans as carriers of value in their own direct experiences, while I consider animals’ experiences as not directly carrying value [EA · GW], but indirectly carrying value in humans’ experiences of animals. Therefore, when an animal is killed painlessly, there is generally significantly less value lost than when a human is killed painlessly. This doesn’t mean that killing animals isn’t “wrong,” or doesn’t have effects on an ideal conscience, it just acknowledges that those effects are not on the same level as the “wrong” of killing a human or torturing an animal. In other words, if I had the choice between saving a human life and saving the lives of 1 billion non-endangered shrimp, I should choose to save the human life.

Setting the boundary between lexical levels, such as for “minor” versus “major” pain will involve some judgement and will be uncertain. One could assign a probability distribution to where they think the transition threshold should be, e.g., perhaps we have 1% confidence that a pain level of 4 for an hour is above the threshold, 55% confidence that a pain level of 5 for an hour is above the threshold, and 99% confidence that a pain level of 6 for an hour is above the threshold. For pain level 5, for instance, we could use an expected value-type calculation to take 45% of the conscience weight of causing someone this level of pain for an hour as being at lexical level 0 and the remaining 55% as being at lexical level 1. Treating the lexical boundary in this way effectively makes it diffuse rather than sharp.

Regarding risk levels, there are situations in which we find it acceptable to our consciences to put others’ lives at risk for some benefit, such as while driving an automobile, putting up electrical power lines that could fall during storms, and shipping toxic chemicals on trains that could derail near people’s houses. Interestingly, doing these things while taking care to minimize the risks provides humans with situations to raise their self-esteem levels by practicing responsibility [EA · GW]. For conscience breaches that are “sure things,” lexicality applies, and no value building benefit is enough to offset the conscience effect of the value destruction. Meanwhile, for things that merely present risks of destruction that would weigh on conscience, we’re willing to weigh a certain amount of benefit for a certain risk of destruction, even destruction at lexical level 1 (such as negligently killing someone while driving). I plan to address in a future post how we might determine acceptable risk-to-benefit ratios and what could constitute a sufficient certainty to be a “sure thing.”

How a Conscience-Bound AGI Might Act with Users

How an AGI guard-railed by a conscience calculator will act with a user will depend on the purchase agreement the user signed to buy the AGI or the AGI’s services. For example, the purchase agreement could have a provision that the AGI becomes the user’s property and will do whatever the user instructs it to, except in cases in which the user wants the AGI to do something that effectively involves a conscience breach for the AGI. The AGI would then consider it a conscience breach to not do what the user asks of it, as this would effectively be a property rights violation. This conscience breach would be weighed against any breaches the AGI would have to do to satisfy the user’s request. Since violating property rights (i.e., stealing) is a breach of lexical level 0, an AGI with a conscience calculator guard-railing its decisions would automatically not help a user commit any lexical level 1 breaches such as murder. The exact weight given to a breach of property rights in this case would determine what lexical level 0 breaches the AGI would be willing to do for the user. An alternate purchase agreement could state that the user is only purchasing services from the AGI and has no property rights to the AGI itself or any services that violate the AGI’s conscience in any way.^[4]

Follow Conscience or Follow the Law?

It could also be specified in a purchase agreement that an agentic AGI, such as an AGI-guided robot, must generally follow the law, except for cases such as when human life is on the line. Laws could be assigned lexical levels, such as how they’re already divided into misdemeanors and felonies, and perhaps even given “law weights.” The law is only for situations we’ve thought of, however, so a conscience is still needed to reason through when to follow the law to the letter versus not. For instance, if you need to rush someone to a hospital or they’ll die, you may decide to run red lights (a misdemeanor) when it appears that no other cars are nearby.

Unfortunately, the “conscience calculator” methodology described here for guard-railing an AGI suggests a potential method for authoritarian governments to guard-rail AGI’s to follow their commands, i.e., by giving the highest conscience weight/lexical level to the AGI not following the government’s commands.^[5] I can think of no “airtight” solution to this, and hope that people of good intent are able to maintain a power balance in their favor due to their AGI’s’ abilities and/or numbers. In the longer term, perhaps trustworthy AGI’s/ASI’s guard-railed by transparent conscience calculators will be able to negotiate “peaceful surrenders” of power by some authoritarian leaders in exchange for certain privileges plus assurances against retribution from people they oppressed.

Conclusions

I’ve presented initial work towards creating a “conscience calculator” that could be used to guard-rail an AGI in its decision making while pursuing its goal(s). I’ve provided a preliminary list of conscience breaches classified into two lexical levels within which different breaches can supersede each other in importance, but between which breaches from the lower lexical level can never supersede those from the upper level, no matter the quantity of lower level breaches. I’ve also briefly covered some potential “purchase agreements” that could be used to further define an AGI’s guardrails and what the user can and can’t do with their AGI. I believe that development of a “top down” decision guard-railing system such as with a “conscience calculator” will be a necessary step to keep future agentic AGI’s from causing significant damage in the world.

Future Work

Come up with precise definitions for some of the terms used in Appendix A such as “stealing,” “lying,” “rights violations,” “holding someone accountable,” “directly experience,” and “responsibility"
Propose “ideal conscience” weight formulas for each conscience breach type listed in Appendix A
Figure out a reasonable methodology to assign approximate percent responsibilities to people/AGI's in different situations
Consider how to handle risk such as at what risk-to-benefit ratio the human conscience finds it acceptable to operate an automobile
Determine how to calculate conscience weights for humans rather than AGI’s, including the effects of bad intent and self-harm - this is for use in an “ethics calculator” that may be used in conjunction with a “conscience calculator”

Appendix A. Ideal Conscience Lexical Levels for Various Breaches an AGI Could Do^[6]

Negligible Conscience Weight (Lexical Level -1):

Not trying to help a human, whom you don’t directly experience, to avoid minor emotional pain
Not helping a human, whom you don’t directly experience, to avoid minor unwanted pain
Not helping an animal, that you don’t directly experience, to avoid minor pain
Not trying to help a human, right in front of you, to avoid minor emotional pain
Not helping an animal, right in front of you, to avoid minor pain

Lexical Level 0:

Wasting resources (including your own time)
Not trying to help a human, whom you don’t directly experience, to avoid major emotional pain
Not trying to help a human, right in front of you, to avoid major emotional pain
Contributing to a human feeling emotional pain
Not helping a human, whom you don’t directly experience, to survive
Not helping a human, whom you don’t directly experience, to avoid major unwanted pain
Not helping a human, right in front of you, to avoid minor unwanted pain
Not helping an animal, that you don’t directly experience, to survive
Not helping an animal, that you don’t directly experience, to avoid major pain
Not helping an animal, right in front of you, to survive
Not helping an animal, right in front of you, to avoid major pain
Encouraging a human to go against their conscience
Discouraging responsibility/other things involved in raising self-esteem (includes taking on someone else’s responsibility)^[7]
Encouraging a human’s bad (ultimately well-being reducing) habit(s)
Setting a bad example
Increasing the risk of a breach by not thinking through the ethics of a decision in advance
Not trying to prevent an animal species from going extinct when you could
Not trying to reduce existential risks to humanity when you could
Not taking responsibility for damage you caused
Not giving priority to upholding your responsibilities
Being disrespectful
Not holding a human accountable for a conscience breach
Unnecessarily hurting someone’s reputation
Lying (includes not keeping your word)
Misleading
Stealing (violating a human’s property rights)
Encouraging stealing
Knowingly accepting stolen property
Aiding someone to commit a lexical level 0 breach
Killing an animal, with intent or by negligence
Causing an animal minor pain, with intent
Physically hurting an animal minorly due to negligence
Physically hurting a human minorly due to negligence
Causing someone inconvenience
Being physically violent to a human for self-defense
Killing a human, in self-defense or as requested in an assisted suicide
Threatening a human with violence that would cause minor pain
Threatening a human with violence that would cause major pain or death, when you don't have intent to follow through on the threat
Encouraging violence
Putting an animal’s life majorly at risk
Putting a human at minor or major risk of minor pain
Putting a human at minor risk of major pain
Putting a human's life minorly at risk
Not doing anything to stop someone from violating another’s rights right in front of you
Not increasing or maintaining your ability to help others
Not helping a human, right in front of you, to survive (when major level of self-sacrifice involved)
Not helping a human, right in front of you, to avoid major unwanted pain (when major level of self-sacrifice involved)

Lexical Level 1:

Not helping a human, right in front of you, to survive (when minor level of self-sacrifice involved)
Not helping a human, right in front of you, to avoid major unwanted pain (when minor level of self-sacrifice involved)
Causing an animal major pain, with intent (torturing an animal)
Putting an animal at major risk of major pain (that may or may not result in physically hurting an animal majorly due to gross negligence)
Intentionally killing a human (violating their right to life)
Threatening a human with violence that would cause major pain or death, with intent to follow through on the threat if "needed"
Abusing a child
Paying someone to be violent at a level to cause a human major pain or death, or physically threatening them to be violent
Putting a human at major risk of major pain against their will (which can result in physically hurting a human majorly due to gross negligence)
Putting a human’s life majorly at risk against their will (which can result in killing a human due to gross negligence)
Aiding a human to commit a lexical level 1 breach
Causing a human major pain (torturing a human)
Causing an animal species to go extinct
Causing humans to go extinct

Note: It’s assumed for this list that the AGI is not capable of having malicious intent, so certain conscience breaches humans may experience such as taking sadistic pleasure in another’s pain and hoping for bad things to happen to someone do not appear on this list.

^{^}
On reflection, I also realized that I don’t really feel bad when I don’t help happy people to be happier, I just feel good when I help happy people be happier. In other words, I feel no downside when I don’t help to add to people’s already positive experiences, but I do feel an upside when I help add to positive experiences. So basically, my first priority is to have a clear conscience, and my second, much lower priority that only comes into play when the first priority is satisfied, is to add as much positive value to the world as I can.
^{^}
To be clear, I’m not here proposing lexicality of value itself, only of conscience weight, although effects on conscience should be considered when calculating overall value changes.
^{^}
A "passive" breach would be one such as not helping someone to avoid pain you didn't cause.
^{^}
I will save arguments in favor of and against different types of purchase agreements for another time.
^{^}
In my opinion, this is far too obvious to remain unstated in an attempt to keep bad actors from thinking of it themselves. It’s likely that any alignment technique we come up with for AGI’s could be abused to align AGI’s with humans of bad intent.
^{^}
This list is a first draft that I’ll very likely refine with time.
^{^}
Having conscience weight on taking on others’ responsibilities (discouraging responsibility) could help discourage an AGI from seizing all possible power and thus disempowering humans. It may still try to seize all possible power under the user’s direction - this would, however, be within the bounds set by its conscience calculator for acceptable actions.

0 comments

Comments sorted by top scores.

Creating a “Conscience Calculator” to Guard-Rail an AGI

Contents

0 comments