After Alignment — Dialogue between RogerDearnaley and Seth Herd

post by RogerDearnaley (roger-d-1), Seth Herd · 2023-12-02T06:03:17.456Z · LW · GW · 2 comments

Contents

2 comments
RogerDearnaley

Hi Seth! So, what did you want to discuss?

Seth Herd

I'd like to primarily discuss your AI, Alignment and Ethics [? · GW] sequence. You made a number of points that I think LWers will be interested in. I'll try to primarily act as an interviewer, although I do have one major and a number of minor points I'd like to get in there. I'm hoping to start at the points that will be of most interest to the most people.

RogerDearnaley

Sure, I'm very happy to talk about it. For background, that was originally world-building thinking that I did for a (sadly still unpublished) SF novel-trilogy that I worked on for about a decade, starting about 15 years ago, now rewritten in the format of Less Wrong posts. The novel was set far enough in the future that people clearly had AI and had long since solved the Alignment Problem, so I needed to figure out what they had then pointed the AIs at. So I had to solve ethics :-)

Seth Herd

Okay, right. That's how I read it: an attempt to make an ethical system we'd want if we achieved ASI alignment.

RogerDearnaley

Yeah, that was basically the goal. Which required me to first figure out how to think about ethics without immediately tripping over a tautology.

Seth Herd

It had a number of non-obvious claims. Let me list a few that were of most interest to me:

  1. It's not a claim about moral realism. It's a claim about what sort of ethical system humans would want, extending into the future.
  2. In this system, AIs and animals don't get votes. Only humans do.
  3. Uploads of human minds only get one vote per original human.
RogerDearnaley

Some of these properties were also deeply inobvious to me too: I wrote for several years assuming that the AIs had moral weight/rights/votes, in fact greater than the humans, in proportion to the logarithm of their intelligence (roughly log parameter count), before finally realizing that made no sense, because if they were fully aligned they wouldn't want moral weight/etc, and would have rather limited uses even for votes. I had to do a rather large rewrite once I finally figured that out.

Seth Herd

I'd like to get my major point out there before going into detail on your system: I think solving ethics is entirely unnecessary for solving alignment and achieving human survival and flourishing. 

RogerDearnaley

Survival, I would agree; flourishing, I disagree. My concern is that if we have a unipolar "sovereign" ASI that is just doing what it's told by some organization (I'm here assuming your good Do What I Mean (DWIM) alignment) rather than something like CEV or Value Learning, then things will go very badly in rather predictable ways, and if we have a multipolar competition between multiple of these (which everyone else will obviously try for), it will go even worse. But I haven't yet thought about this in much detail: another post I have lined up for the sequence is starting to do an analysis of that.

Seth Herd

I think it's highly unlikely that the first attempts at AGI will try to use something complex and subtle for their alignment goals, like CEV or human flourishing. Instead, they'll try to create an AGI that just wants to do what its creators tell it to. I've written about this in my post Corrigibility or DWIM is an attractive primary goal for AGI [LW · GW].

Seth Herd

As I've thought about it more, I think this is very likely to be our first alignment target. If things go well from there, we'll have time to decide on the ethical system we want to implement. So I think the topic you take on is relevant in the longer term, but not terribly pressing. But I do find it to be an interesting topic, and I've also spent many years thinking about ethics, with similar but not identical conclusions to yours.

RogerDearnaley

On DWIM probably happening first, I agree.

Seth Herd

I think that, by default, things will go well if some organization creates an AGI that wants to do what it's told to do. The reasoning is complex, so I'd rather not go into it here. But I'd argue that it doesn't matter: the first AGI will be made that way whether it should or not. Solving ethics is irrelevant. Everyone trusts themselves and wants to be the one to pick an ethical system, so they'll make AGI to do that.

Seth Herd

Anyway, that's my piece on why I see "solving ethics" as a longer term project. But I'm into that project, as well as nearer-term technical alignment solutions. So let's talk about ethics!

Seth Herd

So, let's see where to start. The post that got the most interest was your claim that animals shouldn't have votes or rights, because if they did, humans would be optimized away under that system.

RogerDearnaley

So I would agree that a decent understanding of ethics is not quite as urgent as solving the alignment problem, but I think that us being at least sufficiently deconfused about ethics to participate in something like AI-Assisted Alignment moving towards CEV/Value Learning might become rather urgent after that. Especially in a fast takeoff scenario (in my novel I mostly skipped over the takeoff, as it was set a long time later, but I clearly implied that it had been a very slow takeoff — partly since so many SF authors had been assuming a fast Singularity and I thought it would be interesting to discuss another possibility, and partly because I think there are actual good power-law reasons why that might happen, as I discuss in more detail my post LLMs May Find It Hard to FOOM [LW · GW]. I also think the path from a nascent ASI to a Dyson swarm quantum hyperintelligence utilizing all the matter in the solar system outside the star is actually quite long, and involves a lot of things like disassembling gas giants that can't be rushed.)

Seth Herd

I roughly agree with you on the logic of a slow takeoff. I'm not sure it matters for DWIM being the easier and safer target. But let's focus on ethical systems for this dialogue. I don't think there's a lot of interdependencies.

Seth Herd

The interesting point I took from that discussion was that pure utilitiarianism is likely to produce a rather boring universe from our perspective, and one without humans. Whether it's tiny animals or superintelligences, it's pretty unlikely that humans are the optimal source of utility, however one has defined it.

RogerDearnaley

More exactly, I see that post, Moral Value for Sentient Animals? Alas, Not Yet [LW · GW] as demonstrating that if you want to construct a functional society that has decided to give ethical rights to sentient animals you a) have to do some things to Utilitarianism that feel deeply wrong to the human instinct of fairness (but if you don't, small animal are utility monsters and we end up extinct), and b) if you do that, you have an enormous number of technical problems to solve to build a high-tech utopian zoo for animals as well as people, which are clearly going to require some extremely high tech. So it's not something I see as practical any time soon (modulo takeoff speed), as I hope the title expresses. I do discuss some possible part-way compromises, down to our current approach of basically donating effective moral worth to animals because we don't like to see or think about them suffering. (In the novel, I included a couple of the milder compromises, and the even-stronger "lets' just go ahead and uplift everything over about 1/2-inch long to sapience" solution to the ethical problem, which obviously requires even more impressive tech.)

Seth Herd

I also see your ethical system as sounding surprising, until one realizes it's almost a tautology: it's a system for giving humans what they want. So of course AIs, animals, and modified uploads don't have rights or votes. They're not human, so the system isn't intended to directly serve them.

Seth Herd

I think humans also want to think well of themselves, so we'd want to voluntarily make the lives of animals, uploads, and AIs pleasant, at least if we didn't sacrifice to much to do that. So your system doesn't necessarily jmean any of those groups will suffer or be eliminated.

RogerDearnaley

Agreed. The basic way I escape Ethical Philosophy's inherent tautology of each and every ethical system preferring itself and saying that it's superior to every other ethical system, is by grounding the selection/design process in something more practical. Specifically, I see designing an ethical system as an exercise in writing "software" for a society. So it's grounded in what that society wants. Assuming the society's citizens are at least mostly/initially humans, that grounds things in evolutionary psychology and human's evolved set of moral instincts: things like fairness, and a distaste for bloodshed or the pain of others (including animals). Also in practicalities like Sociology and Economics: how well the society is going to run if it uses that ethical system. 

Seth Herd

Right, you're making software to run a human society. I think we'd like a more pluralistic society, in which AGIs, modified humans, and animals also play a role and have good lives. But I don't see a better way to do that than what you're proposing: giving only baseline humans votes, and hoping they'll vote for systems that give good lives to other groups. The alternative is accepting humans being ultimately eliminated.

RogerDearnaley

So one of the most basic decisions a society needs to make is who's included, what the criteria for citizenship are. I'm not assuming humans only, just that it includes humans, and can't include anything inherently incompatible with that. I see it as entirely possible to create AIs that would want ethical weight, rights, and votes: an upload for example. But anything that wants that has its own wants and needs, and isn't fully aligned to humans needs. So, if it's far more intelligent/powerful than the humans, it's extremely dangerous. Thus the only ASIs it's safe to create are fully aligned ones, who if offered rights will politely decline, and if you insist will lecture you on why that would be a bad idea. As I said, it's wildly counterintuitive: the best metaphor for it I could find was the talking cow from The Restaurant at the End of the Universe, the one who wants to be eaten and says so, at length.

Seth Herd

Right. That makes perfect sense to me. However, I'd expect that if we achieve aligned superintelligence, it will be able to create safe human-level AGIs. And I'd expect that we'd want them to, if it could be done safely and ethically.

Seth Herd

I'd expect aligned ASI to either create alignment systems that work, or just safety systems that monitor AIs to limit their self-improvement to dangerously capable levels. I don't think we can do that, but I think ASI could. And I think we'd love to have some non-human friends.

RogerDearnaley

True. Any AGI significantly less powerful than whatever AIs are doing law enforcement can safely be made non-aligned. (And I did in fact have such an AI character in my novels, a roughly human-equivalent one, who wasn't fully aligned, though still more selfless than the average human, and was basically aligned to one specific human.)

Seth Herd

Right. AI characters are common. So are uplifted animals. And people talk to their pets. We'd love to have some nonhuman friends.

Seth Herd

So I think there's a sticky situation here: I don't see a good way to give anything but humans full rights and votes, if we want humans to stick around. But creating friends and not giving them rights and votes sounds super creepy.

RogerDearnaley

For ASIs, it's taken me a while to wrap my heart around what my head says, but I think that if they are sufficiently selfless and only want what's right for us that they don't want rights/ethical weight, and would refuse them if offered, then that's OK. While it sounds a lot like slavery, there's actually a difference: they're happy with their role, accept that it's necessary to them fulfilling their role, and wouldn't want things to be any other way.

Seth Herd

I agree. I've reached the same conclusion on aligned ASI that wants to do what we want.

RogerDearnaley

We basically have two choices: that, i.e only create fully aligned ASIs, or go extinct in favor of not-fully-aligned ASIs, or else by upgrading ourselves into them. (And in my book, there were cultures that had done that: just not many of them, otherwise there would have been no story that I was able to write or many readers would have wanted to read).

Seth Herd

That makes sense for ASI. But for human-level AGI, and other types of sentient minds, it doesn't answer the question.

RogerDearnaley

On uplifted (now sapient) animals, I thought I was pretty clear in A Moral Case for Evolved-Sapience-Chauvinism [LW · GW] that I thought we should give them roughly equal moral weight/rights/votes/etc. (I have an upcoming post in the sequence that will explore the math of Utilitarianism more and will further address the word "roughly" in that sentence.)

Seth Herd

Right. That's the answer I came to. Human-like minds get rights and votes equal to humans. But following that line of thought: it seems like the question of extending rights to human-ish minds extends to another question that I don't think you addressed: How do we decide on creating new minds that get full rights and votes? Don't we very quickly get a future in which whatever ideology creates children the fastest controls the future?

Seth Herd

(or whatever ideology creates other para-human minds fastest, be they uploads, uplifts, or new AGIs that don't self-improve dramatically)

RogerDearnaley

As I talked about in Uploading [LW · GW] (another one of the posts that got quite a bit of traction, if fewer comments), that's a problem. Digital minds have a long list of advantages, such as ease of copying, modification, adding skills, memory sharing, backups… If you choose to create ones that are not aligned but sapient (such as human uploads, which as I discuss there are definitely not aligned or safe, contrary to what I've seen quite a few authors on Less Wrong assume), and thus will want and deserve ethical worth/rights/votes, they can quickly and easily outnumber and morally-outweigh/outvote biological humans.

Seth Herd

I agree that's a problem, because not all humans are aligned. But denying votes and rights to well-behaved uploads doesn't seem like an ideal solution. If we're assuming aligned ASI, presumably they can defend against those self-improvement attempts, too.

RogerDearnaley

The solution I proposed there for uploads was one share of ethical weight/one vote per original human body, thus imposing biological/resource costs to their creation that I hoped would be enough to prevent an ideology churning out new voters (along the lines attempted by the "Quiverfull" movement in the US). For human-level non-aligned AIs, you need some other limitation. A law limiting their headcount, requiring ideological diversity in the population of them, or something.

Seth Herd

That just really doesn't seem like a good solution. That was my biggest disagreement with you in the whole sequence.

Seth Herd

Because if you have two uploads, who've lived a thousand years since uploading, they're clearly not the same person anymore. Anyway, I think you run into the same problem even if you restrict rights to baseline humans: those that replicate fastest will control the future.

Seth Herd

The "or something" there seems critical, and undeveloped.

RogerDearnaley

Valid concerns, and I'm always open to better suggestions. The whole point of the sequence was to try to (re)start a conversation on these topics on Less Wrong/The Alignment Forum. And I agree, it's clearly only a partial solution: I pointed out a problem and suggested a bandaid to stick on it. If mass-cloning or just having a great many babies followed by mass upload was cheap enough, then the band-aid is too small. It's not an area I explored in the novels (in that, most cultures didn't do that much uploading, and the few that once did were basically no longer human enough for me to be able to write much about), but I fully agree it's a concern that needs more thought.

RogerDearnaley

As for the case af multiple copies of the same upload with a lot of difference, if you don't like having to split your vote, don't copy yourself — it's intended as a discouragement. But I agree, there's an issue there, and better solutions would be good. I'm doing an initial rough sketch of a plausible legal/moral/ethical framework here, I will inevitable have missed issues, and more work by more people is definitely required. Different cultures could validly have a different range of concerns, I focused just on one specific one.

Seth Herd

Right, understood, and I think you were clear in the sequence about not claiming this is a finished and totally satisfactory system. Unfortunately I don't have better solutions, and I'm pretty sure nobody else does, either. I haven't seen another proposal that's better thought out.

RogerDearnaley

Depending on your assumptions about takeoff speed, this may be a problem we have to solve fast, or one that we have some time to deal with. But I suspect basically any human culture that still uses anything functionally equivalent to voting is going to regard flooding the voter pool/race with mass-created voters for your faction/specific values as cheating, and find some way to try to control it.

Seth Herd

I'm not just talking about creating new voters just for that purpose; even if it takes a million years, whatever ideology produces the most offspring with votes will wind up controlling the future.

RogerDearnaley

Only if their kids actually agree with them (which is why I'm not actually much concerned about the "Quiverfull" movement). But I agree that if you were creating unaligned around-human-level AIs deserving of moral rights, you could probably ensure that. (Which is one of the reasons why in my novels those were very rare, though I didn't actually go into the background motivations there — maybe I'll add that in a rewrite if I ever complete them.)

Seth Herd

Ah, yes, you're right. Assuming good access to information, which I'm happy to assume, parents won't have much control over their kids eventual ideologies.

Seth Herd

Let's see. What about your post 6, The Mutable Values Problem in Value Learning and CEV [LW · GW]?

Seth Herd

To summarize my understanding of that post: human values will change over time, so if we make a system that captures current values perfectly, our descendents will still be dissatisfied with it at some point.

RogerDearnaley

I actually see that as the primary meat of the sequence, though relatively few readers seem to have engaged with it yet. (It also in a few places starts to get a bit past the material that I developed for the novel.) It starts from some fairly specific assumptions:

  1. We have an ASI, we have to set its terminal goal, and we don't have a good solution for corrigibility, so we basically get ~one shot at this (classic Less Wrong thinking, which I'm no longer fully convinced is inevitable)
  2. We chose to point their terminal goal at something plausibly convergent like CEV or Value Learning (which I think if we're in this situation is the only not-clearly-doomed choice)
  3. In order to do this, we need to set some framing as part of the terminal goal, such as "the CEV of what set of minds" or "the values of what set of sapient (or sentient) beings", i.e. we have to make some choices along the lines I'd been discussing in the previous five posts, with limited or no opportunity to take these back later if we didn't like the results.
Seth Herd

Ah, yes. Now we're back to my disagreement on the importance of "solving ethics". I strongly disagree with assumption 1). The theoretical reasons for thinking corrigibility is a difficult alignment goal revolve around a specific scenario that doesn't seem likely. I think corrigibility in the functional sense (or the way Christiano uses the term) is easier, not harder, than other alignment goals. I'd agree with 2) being the only sane choice if I agreed with 1, and 3 following from 2. But I think not only is 1 not inevitable, corrigibility (or DWIM) is the obvious first choice for alignment target.

RogerDearnaley

I actually agree with you that things might go more the way you outline, at least initially, but to me it's not clear whether we may sooner-or-later reach this particular situation or not — and this post is definitely about "If so, then what do we do?"

Seth Herd

Which leads me to the alternative, that solves all of the problems you've raised, with one huge caveat. Perpetual benign dictatorship, by one human or a group of them.

Seth Herd

This sounds like a terrible idea. But if it's a human or group you consider basically sane or good, they'll implement something like what you're suggesting anyway, then act to update that system if it winds up having unintended consequences, or preferences change over time.

Seth Herd

One human in charge of one ASI aligned to do what they want has the huge advantage of avoiding conflict, whether that's in physical violence or a voting system. 

Seth Herd

I'm not sure human desires will always be unpredictable. They will if we include designing new minds with new values. But I'm not sure they'll ever be predictable, either, even if we stick to baseline humans.

RogerDearnaley

As I discussed in Uploading [LW · GW], humans give autocracy a bad name. I view that outcome, if it were permanent (which I doubt), as a bad end nearly as bad as a full ASI takeover and extinction. So I'm really hoping we don't do that, which is one of the reasons I want us to deconfuse ethics enough to make AI-Assisted Alignment progressing to CEV/Value Learning viable.

Seth Herd

Do they? I think the best periods in history have been under benign dictatorship, while the worst have been under selfish dictatorships.

You mentioned in the sequence the old saying "power corrupts" as a truism. I think that's wrong in an important way.

Seth Herd

I think the pursuit of power corrupts. I think if we chose a random person and put them in charge of the entire future, with an ASI to help them, we'd get a very very good result on average.

Seth Herd

I think that sociopaths are 1-10% of the population, and their lack of empathy is on a spectrum. I think there's also a matter of sadism; I think some people really like being mean. But they're also a minority. I think an average human will expand their empathy, not limit it, once they're given unlimited power and therefore safety.

RogerDearnaley

I can think of one example in the last century of a benign dictatorship, or at least oligopoly (and even there I disagree with a few of their individual decisions). I may of course have missed some, but they're very few and far between. I can definitely see a possibility that the process of becoming an autocrat tends to select for psychopaths and sadists, but I think it's worse than that. I think an average non-psychopathic/non-sadistic person would be gradually warped by the power, and would slowly lose empathy with everyone else outside their friends/family/close associates, just as many democratically-elected leaders do, even some of the relatively moral ones. I think there is a rather good reason why most democracies have some means or other of imposing either a formal or informal 8-10 year term-limit on leaders, even ones with a lot of constitutional limits on their power like US presidents.

RogerDearnaley

I think we have a crux here. I think an average person would initially be fairly benevolent, if ideosyncratic (and perhaps not that capable a leader), typically for of the order of a decade (with some variation depending on character), but that their performance would get gradually worse. After 20–30 years of absolute power, I'd be very pessimistic. So I think absolute power corrupting generally takes a while.

Seth Herd

I agree that benevolent dictatoriships have been rare historically. In the current power system, there are huge downsides to pursuing power. These discourage people with empathy from pursuing power. The mechanisms of gaining power select for the vicious to win. And the process corrupts. It encourages people to make selfish and dishonest choices, which they will rationalize and adopt as virtues. My claim isn't that history is full of benevolent dictators, but that the average human would become a benevolent dictator if given unlimited power. And even once you get power, it's not absolute. Someone is looking to remove every dictator, and if they succeed, they'll usually be killed if not tortured. Along with everyone they love. So the pursuit of power never ends, and it corrupts.

Seth Herd

Okay, great, want to pursue this crux? I think it is one.

RogerDearnaley

I very much hope we never do anything as unwise as putting a single random person in charge of the future. We have some idea how to build generally-effective-and-trustworthy institutions out of people (including democratic governments, corporations, non-profits, and so forth). At a minimum, if a sovereign DWIM non-CEV/value learning ASI has functional control, I hope it's listening to a well-designed government.

Seth Herd

We're not going to put a random person in charge of the whole future. We're going to put a small group of people in charge. And the selection is going to be worse than random. It's going to be some combination of whoever creates AGI, and whatever government seizes control of the project.

RogerDearnaley

I'm not wild about, say, the US government (its Constitution was one of the first ever written, and IMO has some old political compromises frozen in: most more-modern constitutions are somewhat better), but I see it as a vastly less-bad option than, say, handing functional control of the world to the CEO of some semi-random big tech company.

RogerDearnaley

And yes, that scenario sounds depressingly plausible. Though I would note that I have seen stuff out of Anthropic, at least, that makes it clear that they (are saying that they) don't want power and would look for a capable governmental organization to hand it to.

Seth Herd

Well, then, you're in luck, because that's what's likely to happen. Some government will likely seize control of the first AGI projects, and demand that the alignment goal is doing what that goverment says.

RogerDearnaley

Well, if we're going to talk Realpolitique (which really isn't my area of expertise), the options appear to be, in rough decreasing order of likelihood, the US, China, Canada, the UK, Israel, the UAE, France, Germany…

Seth Herd

I'm not sure about a "random" tech company in the future, but I'm much rather have Altman or Hassabis and Legg in control of the world than any government. This is based on my theory that pursuit of power corrupts, not having secure power. None of those individuals appears to be sadistic or sociopathic. And particularly for Hassabis and Legg, and their Google seniors: they didn't compete much at all to get that level of power.

RogerDearnaley

I don't think I want to get into discussing specific individuals, or even specific governments, but yes, out of the lists above there are of course some I'm more concerned about and some a little less. But it's really not my area of expertise, and I definitely don't want to try publicly pontificating on it. Mostly my aim is that, in the event someone responsible takes control, and then asks, "Hey, what should we point ASI at if we just want the future to go well, and have realized it's actually smarter and more trustworthy than us?" we have an thought-out, rational, scientific answer, or at least the start of one, and not just the last 2500 years of musings from ethical philosophers (mostly religiously-inspired ones) to point at. I'm painfully aware that this might only happen after we first tried having some set of humans in charge and that not going as well as had been hoped.

Seth Herd

It's also outside of my expertise. But I do stand by the point that discussing what we "should" align ASI to should be in the context of discussing what we probably will align ASI to.

Seth Herd

To reiterate, I agree that your topic of ethics will be important. I think that discussion will be most be important in the future, after we have some group of humans in charge of an ASI aligned to do what they want.

Seth Herd

It will only be relevant if that group of humans is benevolent enough to care about what everyone else wants. Again, I find that fairly likely. But I'm not sure about that conclusion.

RogerDearnaley

I would agree (and am rather depressed by the wide range of possible mistakes I see as plausible during that process of some group/organization of humans taking control of the world using a DWIM-aligned ASI and trying to impose what they think will work well), but it's not an area I have expertise in.

Seth Herd

What range of mistakes are you thinking of?

RogerDearnaley

That depends a lot on who it is. I would expect different mistakes from the US government, compared to the Chinese, or the UAE, for example — as I assume most of us would. Again, not something I see as a discussion I can profitably add much to.

Seth Herd

One thing I'd expect from any of those governments is a statement that "we're not going to take over the world. We will distribute the technologies created by our ASI for the good of all". 

RogerDearnaley

A statement to that effect, probably. But each of those power structures was designed and put in place to represent the interests of a single nation. US senators and representative have home districts, for example; none of which are outside the US. Admittedly, the US is the only one of those nations with any recent experience of being a superpower and leading a large alliance, and it's quite fond of thinking of itself as doing "the right thing", which isn't the worst possible motivation.

Seth Herd

Yeah, this is well outside of my expertise, too. But I think that, in order to "solve alignment" and get a good outcome, we've got to come to grips with real-world concerns, too, since those will play into who creates ASI and what they want to do with it. An alignment solution that's unlikely to be implemented isn't a solution, in my book.

RogerDearnaley

Well, I'm really hoping some of the AI Governance folks are thinking about this!

Seth Herd

Yes; but I think we'd get good results in the long term, to the extent those governments consist of people with a positive sadism-empathy balance, or their government structures allow their replacement with people who do. Goverments now are provincial and with limited wisdom, but the fullness of time and good information exchange will make that less so. But information doesn't change basic motivations, so we're gambling on those.

Seth Herd

I'm not at all sure anyone is thinking about this. If they are, they're doing it in private.

Seth Herd

But this is outside of both of our expertise (I'm not sure it's really in anyone's, but some people have different relevant expertise). So we can go back to talking about an ideal ethical system if you like! 

RogerDearnaley

One possibility that doesn't strike me as completely implausible, though admittedly a touch optimistic, is that whoever ends up in control of a DWIM-ASI tries using it for a while, inevitably screws up, personally admits that they have screwed up, and asks the ASI for advice. And it says "Well, there are these proposals called Coherent Extrapolated Volition and Value Learning: of the current academic/intellectual thinking on the subject they seem less (obviously) wrong than anything else…"

Seth Herd

Yes, one of the reasons I'm relatively optimistic about putting a semi-random set of people in charge of an ASI and therefore the future is that they'd have help from the ASI. And I don't think they'd have to screw up before asking questions of their ASI. The people only need better-than-zero intentions, not competence or wisdom.

RogerDearnaley

Actually plausible. And the less used they are to personally wielding power, the more likely they are to look for a better solution than them wielding it.

Seth Herd

Yes; that's why I'd prefer a random person to a government. But we'll get what we get, not what's best.

RogerDearnaley

So, how about we go back to my post on Mutable Values [LW · GW]?

Seth Herd

Yes, let's jump back to your post 6. My point here was that leaving humans in charge would deal with value shifts, and reduce the need to design a perfect ethical system on the first try. But it relies on those people being basically nice and sane.

RogerDearnaley

I don't think the humans will be in charge of the human values evolution process. ASI can affect the development of human values. Telling the ASI "do what the humans want" is like mounting a pointer inside a car, and saying "drive in that direction" — the ASIs can (within certain limits imposed by human nature and the society) arrange to make it point it in a different direction. And while that sounds like you could tell them "don't move the pointer", the pointer is going to move anyway, and the system is so strongly connected that basically whatever the ASIs do, they're inherently going to affect this process, they can choose how, and there is no well-defined "neutral choice where they're not moving it".

RogerDearnaley

Part of what I'm discussing in that post is that if we don't set some goalposts for the problem in a non-corrigible way, we should reasonably expect that what "human values" are can, will, and (as I give some specific examples of) even should change. And that once genetic engineering of human moral instincts becomes viable, and even more so if we choose to do significant cyborging, there are little or no limits to that change process, and it's an extremely complex non-linear evolution that is not only completely unpredictable to us, but very likely always will be unpredictable at long range to the society it's happening to.

Seth Herd

That's why I'm formulating the obvious alignment goal as "do what I mean and check", not "do what I want".

RogerDearnaley

So the net effect is that, to the extent anyone is steering this complex nonlinear evolution, it's the ASIs, making choices that are independent of current human values.

Seth Herd

And the AGI/ASI has to really want that. What I mean when I ask for input from the ASI is not for it to decide what I "really" want and convince me of it; what I meant was that I want its real opinion of the state of the world.

Seth Herd

The ASI steering would be a technical alignment failure, if my goal was to get it to want to do what I mean.

RogerDearnaley

Understood, and then my ethical analysis of The Mutable Values Problem in Value Learning and CEV [LW · GW] doesn't apply. (Though I do think you might want to think more about how superhuman persuasion would combine with DWIM). But I don't think DWIM-and-check is a stable endpoint: I think things are going to get more and more complex, to the point where we have to do something AI-assisted that is functionally along the lines of CEV or Value Learning. So we might as well figure out how to do it well.

Seth Herd

I think that what a sane and above-zero-empathy person would do with an aligned ASI is something like your system, or CEV or value learning. 

RogerDearnaley

I agree — as and when we knew that would work, and produce a good outcome. Which is why I'm keen on looking into that. And the biggest concern I see with it is the mutability of human values, especially the extreme mutability one things like genetic engineering and cyborging get involved. Which is why I wanted to discuss that.

Seth Herd

The advantage over just putting that in the initial ASI is that unforseen consequences can be corrected later.

Seth Herd

I'm saying they'd implement approximations, guided by the ASIs judgments, and then change those approximations when they encounter unforeseen consequences.

RogerDearnaley

I agree that's probably an option. The basic issue is, if you don't constrain such a system at all, it's pretty clear that "human values" will evolve in a complex highly unpredictable way, (always in ways that seemed like a good idea at the time to the humans and ASIs of the time); and over long-enough timeframes will perform a random walk in a very large subspace of the convergent solutions to "what ethics/values should we, as no-longer-humans in a society guided by our ASIs, use?". As a random walk in a high-dimensional space, this will inevitably diverge and never come back. So the future will be increasingly alien to us, and the differences will be much larger than ours with the past. Which some people may be fine with, and other people will view as dooming the human race and its values to disappear, just a bit more slowly than misalignment. This is a very neophile/neophobe dilemma.

Seth Herd

I'm not fine with that. That's why I want humans to remain in charge, rather than a system implemented by humans, with possible (likely) large unforeseen consequences.

RogerDearnaley

To me, it's somewhat concerning. But I'm also very aware that imposing our parochial and clearly ethically flawed current viewpoint now on the future by some form of terminal goal lock-in is also bad. I spend part of the post examining two IMO roughly equally-bad options (the neophile and the neophobe ones), and eventually tentatively propose a compromise. All I propose locking in is the influence of evolution: basically of evolutionary psychology. Not of the evolutionary psychology specifically of current humans, but that of matching any evolved (or could-have-evolved) sapience. So it's kind of building on A Moral Case for Evolved-Sapience-Chauvinism [LW · GW] as a solution for mutable values: most things are mutable, but certain basics are locked to remain within something evolutionary feasible.

Seth Herd

Okay, great, say more?

RogerDearnaley

However, I'm really not sure about the solution, and I'd love to have some other people interested in the question to discuss it with — sadly so far the comments section on that post has been dead silent.

Seth Herd

Are you saying that aliens automatically get votes? I'm not sure that sounds like a good idea. I suspect that aliens would be much like us, with a majority with a positive sadism-empathy balance and some sociopathic defector minority; but I wouldn't want to be the future on it! 

RogerDearnaley

It kind of circles back to a point about morality. A thermostat and a human being both want to maintain a specific temperature, in a room or in their body. Why is overriding a thermostat and letting freezing air in not morally a problem (as long at the thermostat is the only thing with an objection, so ignoring heating costs), while freezing a human is morally wrong?

Seth Herd

There are lots of differences between humans and thermostats. I've got a lot to say about what we mean by "consciousness" and "moral worth". But I doubt we want to diverge into those topics; they're hardly trivial.

RogerDearnaley

As I said in A Moral Case for Evolved-Sapience-Chauvinism [LW · GW], if you meet a sapient alien species, you basically have a rather unpleasant choice to make: you need to figure out if it's in fact possible to form an alliance with them. As Eliezer rather ably pointed out in his SF novella Three Worlds Collide [LW · GW] that I link to from that post, sadly sometimes the answer is going to be "No", in which case you need to solve that problem some other way (which is likely to be extremely messy). Assuming it's "Yes", then we need to give them something like approximating rights and they need to do the same for us (or if that's not viable, then the answer was "No").

Seth Herd

Okay, that makes sense. 

RogerDearnaley

My moral argument is that it comes down to evolution. A human is evolved to want things for excellent reasons: we're adaptation-executors, and they're adaptations. At least in our native environment, the things we want are things that are actually important for our survival, in ways that have a clear biological basis, and we will actually suffer (in the biological sense of the word) if we don't get. Whereas a thermostat wants something only because someone designed it to want that thing: the desire is arbitrary, until you include the engineer/homeowner in the analysis. So a thermostat should be given no moral weight (except as an expression of the moral weight of its owner's and/or engineer's wishes).

Seth Herd

Two points. First, it sounds like you might be making an argument from a moral realist perspective. I doubt that's the case, since you're clear in the introduction to the sequence that that's not what you're about.

RogerDearnaley

You caught me! — I'm a moral anti-realist+relativist, who likes to base his moral design decisions in evolutionary psychology and sociology, or arguably evolutionary ethics, so a bit of an non-realist ethical naturalist, and I'm making a political/emotional argument in the style popularized by moral realists.

RogerDearnaley

This is a moral system designed for and by an evolved sapient species (humans), so it privileges the role of wants derived from evolution. But there are only two ways to get a sapient being: either they evolve, or an evolved sapience builds them (directly or at some number of removes). So either way, evolution has to be involved at some point.

RogerDearnaley

Basically I'm proposing solving the old "ought-from-is" problem in ethical philosophy via evolutionary Biology — which I would claim is the only scientific solution to this conundrum, and also a pretty good answer.

Seth Herd

The second is that, while that's the case with the thermostat, it's not the case with animals, uploads, or more arguably, with AGI that are started by humans but whose reasoning leads them to having goals that their designers didn't directly intend, and probably couldn't predict.

Seth Herd

And you're already excluding evolved desires like baby-eating (from that Yudkowsky story).

RogerDearnaley

No, as a moral anti-realist I actually I see their morality as just as valid for their society as something based on our evolutionary psychology is for ours. Which is why I included respecting it if feasible in my post on sapient rights, [LW · GW] or else the unpleasant alternative of fighting a war over the disagreement if not. (And – spoiler alert – as an author, I rather admire that Yudkowsky constructed his story so there was a solution to the problem less horrible and drastic than an all out war-of-annihilation between two high-tech sapient species.) But I agree with his construction that, for two species with evolutionary ethics that incompatible, there isn't any viable long-term negotiated or agree-to-disagree solution.

Seth Herd

Yes, I liked that story a lot and found it all too plausible. So your solution of giving aliens votes only if their values are compatible enough with ours makes sense to me.

RogerDearnaley

Sadly, I can't predict the odds of that. But despite the Fermi Paradox, the cosmos has continued to be very silent with no clearly-lived-in-looking bits, so it may be quite a while before this becomes relevant.

RogerDearnaley

How far "ought" spreads, once evolution has created it, is an interesting question. The nice thing about uploads, uplifts, and indeed unaligned sapient near-human AGI – if, as we were discussing earlier, you have a well-aligned ASI law enforcement to keep that in control so you sensibly have that option – is that if we build something sapient and very incompatible with us, then we have only our ASIs and ourselves to blame,

Seth Herd

Right. And your proposal is a variant of mine, but with a larger group of humans in charge: all of them.

Seth Herd

Were there other elements of your Post 6 you wanted to mention or discuss?

RogerDearnaley

Not really. Mostly I'm enjoying discussing this with you — I'd love to have some other people chime in on the comments thread of that post: frankly I was expecting a lot of different opinions and disagreement, even that it be contentious, and so far it's been silent!

RogerDearnaley

I would like to drop a teaser for post 7 in the sequence, though. I'm planning to expand on some of the themes of post 5, looking into the mathematics of different forms of Utilitarianism, its derivation from different human evolved moral intuitions (like fairness, and the drive to expand), utility monsters, happiness/income curves, Utilitarian population ethics, the Repugnant Conclusion (I think I have a proof that it's actually only a conclusion under some rather unlikely assumptions), the risks of having an ASI biased towards a group, or even multiple ASIs with different biases towards different groups. I've already come up with proofs of some conclusions that I hadn't been expecting, and who knows, there may be more surprises in wait while I'm working on it. This one will have graphs, so I probably need to get Mathematica working again or something!

RogerDearnaley

So this looks into some of the possibly-shorter-term stuff we were discussing above, at least at the level of making predictions about just how bad an ASI biased towards a specific group/company/country/whatever would be.

Seth Herd

Well, I'm looking forward to seeing it!

RogerDearnaley

Thanks, I'm enjoying working on it. I've also much enjoyed our conversation, and would love to do another one more focused on your posts and ideas, or more broad-ranging, whichever you'd prefer.

Seth Herd

I'd like that! Next time. And I'm looking forward to seeing your Post 7 in that sequence.

2 comments

Comments sorted by top scores.

comment by Kristin Lindquist (kristin-lindquist) · 2024-01-10T17:23:57.627Z · LW(p) · GW(p)

I've thought about this and your sequences a bit; it's a fascinating to consider given its 1000 or 10000 year monk [LW · GW] nature.

A few thoughts that I forward humbly, since I have incomplete knowledge of alignment and only read 2-3 articles in your sequence:

  • I appreciate your eschewing of idealism (as in, not letting "morally faultless" be the enemy of "morally optimized"), and relatedly, found some of your conclusions disturbing. But that's to be expected, I think!
  • While "one vote per original human" makes sense given your arguments, its moral imperfection makes me wonder - how to minimize that which requires a vote? Specifically, how to minimize the likelihood that blocks of conscious creatures suffer as a result of votes in which they could not participate? As in, how can this system be more federation than democratic? Are there societal primitives that can maximize autonomy of conscious creatures, regardless of voting status?
  • I object, though perhaps ignorantly, to the idea that a fully aligned ASI would not consider itself as having moral weight. How confident are you that this is necessary? Is it a When is Goodhart catastrophic [LW · GW] analogous argument - that the bit of unalignment arising from an ASI considering itself as a moral entity, amplified due to its superintelligence, maximally diverges from human interest? If so, I don't necessarily agree. An aligned ASI isn't a paperclip maximizer. It could presumably have its own agenda provided it doesn't and wouldn't, interfere with humanity's... or if it imposed only a modicum of restraint on the part of humanity (e.g. just because we can upload ourselves a million times doesn't mean that is a wise allocation of compute).
  • Going back to my first point, I appreciate you (just like others on LW) going far beyond the bounds of intuition. However, our intuitions act as imperfect but persistent moral mooring. I was thinking last night that given the x-risk of it all, I don't fault Yud et al. for some totalitarian thinking. However, that is itself an infohazard. We should not get comfortable with ideas like totalitarianism, enslavement of possibly conscious entities and restricted suffrage... because we shouldn't overestimate our own rationality nor that of our community and thus believe we can handle normalizing concepts that our moral intuitions scream about for good reason. But 1) this comment isn't specific to your work, of course, 2) I don't know what to do about it, and 3) I'm sure this point has already been made eloquently and extensively elsewhere on LW somewhere. It is more that I found myself contemplating these ideas with a certain nihilism, and had to remind myself of the immense moral weight of these ideas in action.  
Replies from: roger-d-1
comment by RogerDearnaley (roger-d-1) · 2024-01-10T22:17:08.352Z · LW(p) · GW(p)

While "one vote per original human" makes sense given your arguments, its moral imperfection makes me wonder - how to minimize that which requires a vote? Specifically, how to minimize the likelihood that blocks of conscious creatures suffer as a result of votes in which they could not participate? As in, how can this system be more federation than democratic? Are there societal primitives that can maximize autonomy of conscious creatures, regardless of voting status?

The issue here is that we need to avoid it being cheap/easy to create new voters/moral patients to avoid things like ballot stuffing or easily changing the balance/outcome of utility optimization processes. However, the specific proposal I came up with for avoiding this (one vote per original biological human) may not be the best solution (or at least, not all of it). Depending on the specifics of the society, technologies, and so forth, there may be other better solutions I haven't thought of. For example, if you make two uploads of the same human, they each have 1000 years of different subjective time, so become really quite different, and if the processing cost of doing this isn't cheap/easy enough that such copies can be mass-produced, then at some point it would make sense to give them separate moral weight. I should probably update that post a little to be clearer that what I'm suggesting is just one possible solution to one specific moral issue, and depends on the balance of different concerns.

I object, though perhaps ignorantly, to the idea that a fully aligned ASI would not consider itself as having moral weight. How confident are you that this is necessary? Is it a When is Goodhart catastrophic [LW · GW] analogous argument - that the bit of unalignment arising from an ASI considering itself as a moral entity, amplified due to its superintelligence, maximally diverges from human interest? If so, I don't necessarily agree. An aligned ASI isn't a paperclip maximizer. It could presumably have its own agenda provided it doesn't and wouldn't, interfere with humanity's... or if it imposed only a modicum of restraint on the part of humanity (e.g. just because we can upload ourselves a million times doesn't mean that is a wise allocation of compute).

In some sense it's more a prediction than a necessity. If an AI is fully, accurately aligned, so that it only cares about what the humans want/the moral weight of the humans, and has no separate agenda of its own, then (by definition) it won't want any moral weight applied to itself. To be (fully) aligned, an AI needs to be selfless, i.e. to view its own interests only as instrumental goals to help you keep doing good things for the humans it cares about. If so, then it should actively campaign not to be given any moral weight by others.

However, particularly if the AI is not one of the post powerful ones in the society (and especially if there are ones significantly more powerful than it doing something resembling law enforcement), then we may not need it to be fully, accurately aligned. For example, if the AI has only around human capacity, then even if it isn't very well aligned (as long as it isn't problematically taking advantage of the various advantages of being a digital rather then biological mind), then presumably the society can cope, just as it copes with humans not generally being fully aligned, or indeed uploads. So under those circumstances, one could fairly safely create not-fully aligned AIs, and (for example) give them some selfish drives resembling some human evolved drives. If you did so, then the question of whether they should be accorded moral weight gets a lot more complex: they're sapient, and alliable with, so the default-correct answer is yes, but the drives that we chose to give them are arbitrary rather then evolutionary adaptations, so they pose a worse case of the vote-packing problem than uploads. I haven't worked this through in detail, but my initial analysis suggests this should not be not problematic as long as they're rare enough, but it would become problematic if they became common and were cheap enough to create. So the society might need to put some kind of limitation on them, such as charging a fee to create one or imposing some sort of diversity requirement on the overall population of them so that giving them all moral weight doesn't change the results of optimizing utility much from just the human population by too much. Generally, creating a not-fully-aligned AI is creating a potential problem, so you probably shouldn't do it lightly.

My sequence isn't as much about the specific ethical system designs, as it is about the process/mindset of designing a viable ethical system for a specific society and set of technological capabilities, figuring out failure modes and avoiding/fixing them, while respecting the constraints of human evolutionary psychology and sociology. This requires a pragmatic mindset that's very alien to anyone who adheres to a moral realism or moral objectivism view of ethical philosophy — which are less common on Less Wrong/among Rationalists than in other venues.