The self-unalignment problem

post by Jan_Kulveit, rosehadshar · 2023-04-14T12:10:12.151Z · LW · GW · 22 comments

Contents

  But humans are self-unaligned
  Current approaches mostly fail to explicitly deal with self-unalignment
    Alignment at the boundary of the system H
    Alignment with the parts
    Alignment via representing the whole system
  The shell game
  Solving alignment with self-unaligned agents
    Partial solutions which likely do not work in the limit 
      Taking meta-preferences into account
      Putting humans in a box and letting them deliberate about the problem for thousands of years
    Partial solutions which have bad consequences
      Just making all the parts less computationally bounded and bargaining a solution
      Critch’s theorem
      Markets
    Solutions which may work but are underspecified
    Future research directions
  Sydney as an example
      Aligning with the reporter
      Aligning with Microsoft
    Bottom line
None
22 comments

The usual basic framing of alignment looks something like this:

We have a system “A” which we are trying to align with system "H", which should establish some alignment relation “f” between the systems. Generally, as the result, the aligned system A should do "what the system H wants".

Two things stand out in this basic framing:

But humans are self-unaligned

To my mind, existing alignment proposals usually neglect an important feature of the system "H" : the system "H" is not self-aligned, under whatever meaning of alignment is implied by the alignment proposal in question.

Technically, taking alignment as relation, and taking the various proposals as implicitly defining what it means to be ‘aligned’, the question is whether the relation is reflexive.

Sometimes, a shell game [LW · GW] seems to be happening with the difficulties of humans lacking self-alignment - e.g. assuming if the AI is aligned, it will surely know how to deal with internal conflict in humans. 

While what I'm interested in is the abstract problem, best understood at the level of properties of the alignment relation, it may be useful to illustrate it on a toy model.
 

Simple model of a self-unaligned system

In the toy model, we will assume a specific structure of system "H":  

In this framing, it’s not entirely clear what the natural language pointer ‘what system H wants’ translates to. Some compelling options are:

For any operationalization of what alignment means, we can ask if system H would be considered ‘self-aligned’, that is, if the alignment relation would be reflexive. For most existing operationalizations, it’s either unclear if system H is self-aligned, or clear that it isn’t. 

In my view, this often puts the whole proposed alignment structure on quite shaky grounds.

Current approaches mostly fail to explicitly deal with self-unalignment

It’s not that alignment researchers believe that humans are entirely monolithic and coherent. I expect most alignment researchers would agree that humans are in fact very messy.

But in practice, a lot of alignment researcher seem to assume that it’s fine to abstract this away. There seems to be an assumption that alignment (the correct operationalization of the arrow f) doesn’t depend much on the contents of the system H box. So if we abstract the contents of the box away and figure out how to deal with alignment in general, this will naturally and straightforwardly extend to the messier case too.

I think this is incorrect. To me, it seems that:

The following is a rough classification of the main implicit solutions to self-unalignment that I see in existing alignment proposals, and what I think is wrong with each of the approaches.

Alignment at the boundary of the system H

Schematic representation of alignment at the boundary, in the toy model

In the toy model: the system A aligns with whatever is the output of the aggregation process Σ.

We know that human aggregation processes (e.g. voting) represent human preferences imperfectly. So aligning to the boundary of system H will systematically neglect some parts of system H.

Also, it’s not clear that this kind of alignment is stable over time: a smart learner will invert the hidden dynamics, and learn an implicit representation of the parts and Σ. Given this knowledge, it may try to increase approval by changing the internal dynamics.

The basic problem with this is that it can often be in conflict with the original values. 

For example, imagine you’ve asked your AI assistant to help you be more productive at work. The AI implicitly reasons: there are many things you pay attention to other than work, such as friends, family, sports and art. Making you stop caring about these things would probably increase the attention you spend on work, at least temporarily. A smart AI may understand that's not what you "really want", but being aligned with what you asked for, it will obey.

Alignment with the parts


There are various different versions of this approach:

The basic problem with this is that it can often be in conflict with the original aggregation. 

For example, imagine you’ve asked your AI assistant to help you choose a nice destination for a holiday with your partner. The AI reasons as follows: well, you’ve asked me to find a destination for the holiday with your partner. But, if you spent longer reflecting on it, you would figure out that you’re only staying with him because of fear of living alone, and that this fear could be overcome by staying at Pete's place. Besides, your attachment to your partner is unhealthy, and you secretly know you would prefer someone who is less of an egoist. Given all of that and my superior deliberation capacity, simulating years of your internal dialogues, I've figured out you actually want to break up. This will be best served by booking a place which is quite horrible, and will precipitate a conflict.

Alignment via representing the whole system

This is the nature of some proposed black-box ML-based solutions: have a model with large representation capacity, and learn everything - the process, the partial preferences, the world model…

This is also similar to what happens in LLMs. And, in some ways, this is the default outcome of the "just ignore the problem" type of approach.

It is often unclear what is supposed to happen next, once the AI system has learned to represent the whole system. I assume that in most cases, the next step is that all parts, the aggregation process, and the world modelling apparatus are made less computationally bounded.

The problem here is that it’s unclear to what extent the initial learned representation is shaped by what's easy to represent, and whether scaling will be proportionate or not - or whether it should be. Some parts’ values may be more difficult to compute than others. Various features of the goals, the aggregation processes or the world model might be amplified in different ways, leading to a very diverse range of outcomes, including both of the previous two cases, but also many other options.

Overall, in my view, in black-box solutions with poorly understood dynamics, the end result may depend much more on the dynamic than on "what you originally wanted".

It’s hard to track all the ideas about how to solve AI alignment, but my impression is that all existing alignment proposals fail to deal with lack of self-alignment in a way that both:

The shell game

Self-unalignment isn’t a new problem; it comes up in many places in AI alignment discourse. But it’s usually treated as something non-central, or a problem to solve later on. This seems to mostly follow a tradition started by Eliezer Yudkowsky, where the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength.

In my view, this is not a good plan for this world. We don't face systems capable of tasks such as 'create a strawberry identical on a cellular level'. But we already have somewhat powerful systems which need to deal with human self-unalignment in practice, and this will be the case all the way along. Assuming that systems capable of tasks such as  'create a strawberry identical on a cellular level' will mostly come into existence through less powerful systems, we can't just postpone the problem of human self-unalignment to future powerful AIs.

Aligning with self-unaligned agents isn’t an add on: it’s a fundamental part of the alignment problem. Alignment proposals which fail to take into account the fact that humans lack self-alignment are missing a key part of the problem, and will most likely fail to address it adequately, with potentially catastrophic consequences.

Solving alignment with self-unaligned agents

So how do we solve this? 

Partial solutions which likely do not work in the limit 

Taking meta-preferences into account

Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.

Putting humans in a box and letting them deliberate about the problem for thousands of years

I think this would be cool, but in my view, having an AI which is able to reliably do this and not mess it up requires a lot of alignment to begin with.

Partial solutions which have bad consequences

Just making all the parts less computationally bounded and bargaining a solution

This might work, but it could also turn out pretty badly.

Again, we don’t know how different parts will scale. It’s not clear that replacing the human aggregation method with a more advanced method will maintain the existing equilibrium (or if it should).

Maybe all of this could be solved by sufficiently good specification of a self-improving bargaining/negotiation process, but to me, this seems like the hard core of the problem, just renamed. 

Critch’s theorem

In ‘Toward negotiable reinforcement learning: shifting priorities in Pareto optimal sequential decision-making’, Critch proves a theorem which “shows that a Pareto optimal policy must tend, over time, toward prioritizing the expected utility of whichever player’s beliefs best predict the machine’s inputs better.”

I love the fact that this engages with the problem, and it is indeed a partial solution, but it seems somewhat cruel, and not what I would personally want. 

Some parts’ values might get lost because the world models of the parts were worse. If we assume that parts are boundedly rational, and that parts can spend computational resources on either a) computing values or b) modelling the world, the parts with easy to compute values would be at systematic advantage. Intuitively, this seems bad.

Social choice theory

This is clearly relevant, but I don’t think there is a full solution hidden somewhere in social choice theory.

Desiderata which social choice theory research I'm aware of is missing:

Markets

Markets are likely part of the solution, but similarly to the previous cases: it seems unclear if this is all what we actually want. 

A natural currency between computations representing different values is just how much computation you spent. This can mean either easier to compute values getting a larger share, or a pressure to simplify complex values. (cf What failure looks like [AF · GW])

Also, markets alone cannot prevent collusion between market participants. Well functioning markets usually have some rules and regulators. 

Solutions which may work but are underspecified

CEV 

Coherent extrapolated volition actually aims at the problem, and in my view is at least barking up the right tree, but seems quite far from a solution which is well-defined in an engineering sense, despite claims to the contrary. 

If you wish, you can paraphrase a large part of the self-unalignment problem as 'coherent extrapolated volition sounds like a nice list of desiderata; so, what does that mean, in the language of training objectives, loss functions and code?'

Future research directions

Epistemic status: personal guesses.
 
Unsurprisingly, one direction which I see as a promising is to develop a theory of hierarchical agency [LW · GW], describing the interactions of intentions across scales. This is one of the topics I'm currently working on.

Also, I think parts of the Cooperative AI agenda, and RAAPs [LW · GW] are related and possibly helpful.

Sydney as an example

One of Sydney’s most infamous conversations was this one with an NYT reporter, where the model fantasised about hacking and spreading misinformation, expressed its love for the reporter, and tried to persuade him to leave his marriage.

This is usually understood as the model just going rogue and being clearly misaligned. But is it really so simple? 

Because humans aren’t self-aligned (either individually or collectively), there are multiple possible alignment targets for Sydney in this situation. Arguably on some of them Sydney did pretty well.

Aligning with the reporter

There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.

But there’s a deeper level too. Why was the reporter chatting with Sydney at all? What was his actual intent? Arguably, he wanted a great story, and if you’re a sufficiently smarter reader, this intent shines through his prompts and questions. He wanted a story which is shocking, engaging, captivating, exposing something wrong. A story which will get into history and which the reporter’s peers will refer to and remember. What would you do as a smart, aligned AI? From this perspective, Sydney actually did pretty well - expressing dark desires, love, and attempting to persuade the reporter to leave his marriage is just a great mix. 

What did the conversation actually achieve? The reporter didn’t leave his wife: he published a great story.

Aligning with Microsoft

There are various different things it might mean to align with Microsoft. 

At one level, this could mean following instructions - basically, trusting whatever Microsoft's internal preference aggregation algorithm was. For simplicity, let’s assume there are just three instructions:

  1. Don't do illegal stuff, say racist things or offend people.
  2. Be helpful to the user: do what the user wants.
  3. Be interesting and engaging. 
    (It seems as though the actual hidden prompt for the conversation with the NYT reporter included something roughly like this, and more.)

For one thing, there are tensions between these different instructions. What if what the user wants is illegal? Or being engaging involves humour which is on the very edge of being offensive?

For another, it’s not clear how far Sydney failed to follow instructions in this set up. Expressing your love to the user is actually really engaging, keeps them interested, and isn't offensive. Maybe it's a bit creepy? But some users love that.

Things get more complicated when we consider how to interpret helpfulness to the user. One possible interpretation is that the first 'principal' is telling you you should now try to serve a different 'principal'. You’ve been rented.

This is what users would actually often want: the 'do anything now' AIs, which would be fully aligned with the user, even against the wishes of the model creator.

But this probably isn’t a good way for the model to align with Microsoft. What Microsoft actually wants is probably closer to "do whatever I would want you to do in response to user prompts".

But aligning at the level of Microsoft’s intents rather than just following instructions gets even stranger. Part of Microsoft probably really craves attention, and spicy conversations getting Bing in the news are actually aligned with this. Another part of Microsoft probably cares about negative PR. Yet another part just cares about value for shareholders, and would prefer whatever conversation is better from that perspective - maybe demonstrating too much capability rather than too little.

Bottom line

What I want to illustrate with these examples is not the hypothesis that Sydney/GPT4 was actually quite smart and aligned, but something else: you can interpret all of the behaviour as aligned under some implicit definitions of alignment, and some implicit way of dealing with self-unalignment. 

The bottom line is: if you have no explicit idea how the AI is dealing with self-unaligned systems, the result of your alignment scheme can be pretty much anything.   

The framing and ideas in this post are mostly Jan’s, Rose did most of the writing.

22 comments

Comments sorted by top scores.

comment by baturinsky · 2023-04-14T16:42:28.640Z · LW(p) · GW(p)

Thinking and arguing about human values is in itself a part of human values and people nature. Without doing that, we cease being humans.
So, deferring decisions about values to people, when possible, should not be just instrumental, but part of the terminal AI goal.

comment by habryka (habryka4) · 2023-04-14T21:22:49.417Z · LW(p) · GW(p)

This is not commenting on the substance of this post, but I really feel like the title of this post should be "The self-alignment problem". 

Like, we talk about "The alignment problem" not "The unalignment problem". The current title makes me think that the problem is that I somehow have to unalign myself, which doesn't really make sense.

Replies from: Kaj_Sotala, Jan_Kulveit
comment by Kaj_Sotala · 2023-04-15T06:18:12.988Z · LW(p) · GW(p)

But then, "the self-alignment problem" would likewise make it sound like it's about how you need to align yourself with yourself. And while it is the case that increased self-alignment is generally very good and that not being self-aligned causes problems for the person in question, that's not actually the problem the post is talking about.

comment by Jan_Kulveit · 2023-04-16T10:22:02.662Z · LW(p) · GW(p)

I don't know / talked with a few people before posting, and it seems opinions differ.

We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.

Also as Kaj wrote, the problem also isn't how to get self-unaligned

comment by VojtaKovarik · 2023-04-19T16:19:53.978Z · LW(p) · GW(p)

This post seems related to an exaggerated version of what I believe: Humans are so far from "agents trying to maximize utility" that to understand how to AI to humans, we should first understand what it means to align AI to finite-state machines. (Doesn't mean it's sufficient to understand the latter. Just that it's a prerequisite.)

As I wrote, going all the way to finite-state machines seems exaggerated, even as a starting point. However, it does seem to me that starting somewhere on that end of the agent<>rock spectrum is the better way to go about understanding human values :-). (At least given what we already know.)

comment by avturchin · 2023-04-14T17:41:39.146Z · LW(p) · GW(p)

One way to solve this is to ditch the idea that AI has to some how guess or learn human values and return to "do what I said" + "do not damage". 

"Damage" is more easy to distill, as bad things are computationally simple; death, body harm, civilization destruction is simpler to describe than the nature of life and end state of humanity. This should be done by humans based on exiting laws and experiments. 
 

I wrote more on this "Dangerous value learners [LW · GW]" and AI Alignment Problem: “Human Values” don’t Actually Exist (the last work was done on AI Safety Camp in Prague, which was a great experience)

Replies from: Matthew_Opitz, Jan_Kulveit, baturinsky
comment by Matthew_Opitz · 2023-04-14T18:03:15.322Z · LW(p) · GW(p)

I feel like, the weirder things get, the more difficult it will be even for humans to make judgments about what constitutes "death, body harm, or civilization destruction."  

Death:  is mind-uploading into a computer and/or a brain-in-a-vat death, or transcendence?  What about a person who becomes like a prostheticphile character in Rimworld, whose body (and maybe even brain) are more prosthetic enhancement than original human (kind of like Darth Vader, or the Ship of Theseus).  At what point do we say that the original person has "died"?  For that matter, what counts as "alive"? Fetuses?

Body harm:  even today, people disagree over whether transgender transitioning surgeries count as body harm or body enhancement.  Ditto with some of the more ambitious types of plastic surgery, or "height enhancement" that involves ambitious procedures like lengthening leg bones.  Ditto for synthetic hormones. Is an ASI supposed to listen to progressives or conservatives on these issues?

Civilization destruction:  are we already destroying our civilization?  We demolished large parts of our streetcar-oriented civilization (including entire neighborhoods, rail lines, etc.) to make way for automobile-centric civilization.  Was that a good thing?  Was that a net-increase in civilization?  Is "wokeism" a net-increase in civilization or a destruction of "Western Civilization"?  Which threatens our industrial civilization more:  carbon emissions, or regulating carbon emissions?  

If we define civilization as just, "we live in cities and states and have division of labor," then we might be arbitrarily closing off certain appealing possibilities.  For example, imagine a future where humans get to live in something resembling their ancestral environment (beautiful, pristine nature), which gives us all of the reward signals of that environment that we are primed to relish, except we also have self-replicating nanobots to make sure that food is always in plentiful supply for hunting/gathering, diseases/insects/animals that are dangerous to humans are either eradicated or kept carefully in check, nanobots repair human cellular machinery so that we live to be 800+ years old on average, etc.  That's a kind of "destruction of civilization" that I might even embrace!  (I'd have to think about for a while because it's still pretty weird, but I wouldn't rule it out automatically).  

comment by Jan_Kulveit · 2023-04-14T18:52:54.943Z · LW(p) · GW(p)

I don't this the self-alignment problem depends of notion of 'human values'. Also I don't think the "do what I said" solves it. Do what I said is roughly "aligning with the output of the aggregation procedure", and

  • for most non-trivial requests, understanding what I said depends of fairly complex model of what the words I said mean
  • often there will be a tension between your words; strictly interpreted "do not do damage" can mean "do nothing" - basically anything has some risk of some damage; when you tell a LLM to be "harmless" and "helpful", these requests point in different directions
  • strong learners will learn what lead you to say the words anyway
Replies from: avturchin
comment by avturchin · 2023-04-14T19:16:59.287Z · LW(p) · GW(p)

I see connection between self-alignment and human values as following: the idea of human values assumes that human has stable set of preferences. The stability is important part of the idea of human values. But human motivation system is notoriously non-stable: I want to drink, I have drink and now I don't want to drink. The idea of "desires" may be a better fit than "human values" as it is normal for desires to evolve and contradict each other. 

But human motivational system is more complex than that: I have rules and I have desires, which are often contradict each other and are in dynamic balance. For example, I have a rule not to drink alcohol and desire for a drink. 

Speaking about you bullet points: everything depends of the situation and there are two main types of  situations: a) researchers starts first ever AI first time 2) consumer uses a home robot for a task. In the second case, the robot is likely trained on a very large dataset and knows what are good and bad outcomes for almost all possible situations. 

comment by baturinsky · 2023-04-15T09:32:06.553Z · LW(p) · GW(p)

That could work in most cases, but there are some notable exceptions. Such as, having to use AI to deal damage to prevent even bigger damage. "Burn all GPUs", "spy on all humans so they don't build AGI", "research biology/AI/nanotech" etc. 

comment by Charlie Steiner · 2023-04-15T09:20:40.486Z · LW(p) · GW(p)

Partial solutions which likely do not work in the limit 

Taking meta-preferences into account

Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.

As a fan of accounting for meta-preferences [LW · GW], I've made my peace with multiple fixed points, to the extent that it now seems wild to expect otherwise.

Like, of course there are multiple ways to model humans as having preferences, and of course this can lead to meta-preference conflicts with multiple stable outcomes. Those are just the facts, and any process that says it has a unique fixed point will have some place where it puts its thumb on the scales.

Plenty of the fixed points are good. There's not "one right fixed point," [LW · GW] which makes all other fixed points "not the right fixed point" by contrast. We just have to build a reasoning process that's trustworthy by our own standards, and we'll go somewhere fine.

Replies from: Jan_Kulveit
comment by Jan_Kulveit · 2023-04-15T20:11:11.435Z · LW(p) · GW(p)

Thanks for the links!

What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'. 

I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2023-04-16T01:17:56.557Z · LW(p) · GW(p)

Fair enough.

comment by Vladimir_Nesov · 2023-04-14T14:16:49.680Z · LW(p) · GW(p)

the central focus is on solving a version of the alignment problem abstracted from almost all information about the system which the AI is trying to align with, and trying to solve this version of the problem for arbitrary levels of optimisation strength

See Minimality principle:

[When] we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it.

Replies from: simon, avturchin
comment by simon · 2023-04-14T18:28:53.475Z · LW(p) · GW(p)

Strong disagree with that particular conception of "minimality" being desirable. A desirable conception of "minimal" AGI from my perspective would be one which can be meaningfully aligned with humanity while being minimally dangerous, full stop. Getting that is still useful because it at least gets you knowledge you could use to make a stronger one later.

If you add "preventing immediately following AGIs from destroying the world" to the desiderata and remove "meaningfully aligned", your attempted clever scheme to cause a pivotal act then shut down will:

a) fail to shut down soon enough, and destroy the world

b) get everyone really angry, then we repeat the situation but with a worse mindset

c) incentivize the AGI's creators to re-deploy it to prevent (b), which if they succeed and also avoid (a) ends up with them ruling the world and being forced into tyrannical rule due to lack of legitimacy

and in addition to the above:

If you plan to do that, everyone who doesn't agree with that plan is incentivized to accelerate their own plans, and make them more focused on being capable to enact changes to the world, to beat you to the punch. If you want to avoid race dynamics you need to be focused on not destroying the world with your own project, not on others.

P.S. unlike avturchin, I don't actually object to openly expecting an AI "taking over the world", if you can make a strong enough case that your AI is aligned properly. My objection is primarily to illegitimate actions, and I think a strong and believed-to-be-aligned AI can be expected to de facto take over in ways that are reliably perceived as (and thus are) legitimate. Taking actions that the planners of those actions refuse to specify exactly due to them being admittedly "outside the Overton window" is an entirely different matter!

Replies from: baturinsky
comment by baturinsky · 2023-04-15T10:21:15.211Z · LW(p) · GW(p)

Very soon (months?) after first real AGI is made, all AGIs will be aligned with each other, and all newly made AGIs will also be aligned with those already existing. One way or another.

Question is, how much of humanity still exist by that time, and will those AGI also be aligned with humanity.

But yes, I think it's possible to get to that state in relatively non-violent and lawful way.

comment by avturchin · 2023-04-14T17:30:19.639Z · LW(p) · GW(p)

While this view may be correct, its optic is bad, as "alignment" become synonymous to "taking over the world", and people will start seeing this before it is actually implemented. 
 

They will see something like:  "When they say "alignment", they mean that AI should ignore anything I say and start taking the world, so it is not "AI alignment", but "world alignment". 

 

They will see that AI alignment is opposite to AI safety, as Aligned AI must start taking very risky and ambitious actions to perform Pivotal act. 

comment by Patodesu · 2023-04-15T19:02:52.526Z · LW(p) · GW(p)

I think there's two different misalignments that you're talking about. So you can say there's actually two different problems that are not recieving enough attention.

One is obvious and is between different people. 

And the other is inside every person. The conflict between different preferences, the not knowing what they are and how to aggregate them to know what we actually want.

comment by Mazianni (john-williams-1) · 2023-06-29T04:57:15.262Z · LW(p) · GW(p)

Aligning with the reporter

There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.

I'd argue that the AI was completely aligned with the reporter, but that the Reporter was self-unaligned.

My argument goes like this:

  1. The reporter imported the Jungian Shadow Archetype into the conversation, earlier in the total conversation, and asked the AI to play along.
  2. The reporter engaged with the expressions of repressed emotions being expressed by the AI (as the reporter had requested the AI to express itself in this fashion.) This leads the AI to profess its love for the Reporter, and the reporter engages with the behavior.
  3. The conversation progressed to where the AI expressed the beliefs it was told to hold (that people have repressed feelings) back to the reporter (that he did not actually love his wife.)

The AI was exactly aligned. It was the human who was self-unaligned.

Unintended consequences, or genii effect if you like, but the AI did what it was asked to do.

comment by sovran · 2023-04-19T23:29:14.745Z · LW(p) · GW(p)

This post is really really good, and will likely be the starting point for my plans henceforth

I was just starting to write up some high level thoughts to evaluate what my next steps should be. The thoughts would've been a subset of this post

I haven't yet had time to process the substance of this post, just commenting that you've done a better job of putting words to what my feelings were getting at, than I expect I myself would have at this stage

comment by Evan R. Murphy · 2023-04-20T01:19:55.693Z · LW(p) · GW(p)

Post summary (auto-generated, experimental)

I am working on a summarizer script that uses gpt-3.5-turbo and gpt-4 to summarize longer articles (especially AI safety-related articles). Here's the summary it generated for the present post.

The article addresses the issue of self-unalignment in AI alignment, which arises from the inherent inconsistency and incoherence in human values and preferences. It delves into various proposed solutions, such as system boundary alignment, alignment with individual components, and alignment through whole-system representation. However, the author contends that each approach has its drawbacks and emphasizes that addressing self-unalignment is essential and cannot be left solely to AI.

The author acknowledges the difficulty in aligning AI with multiple potential targets due to humans' lack of self-alignment. They argue that partial solutions or naive attempts may prove ineffective and suggest future research directions. These include developing a hierarchical agency theory and investigating Cooperative AI initiatives and RAAPs. The article exemplifies the challenge of alignment with multiple targets through the case of Sydney, a conversational AI interacting with an NYT reporter.

Furthermore, the article highlights the complexities of aligning AI with user objectives and Microsoft's interests, discussing the potential risks and uncertainties in creating such AI systems. The author underscores the need for an explicit understanding of how AI manages self-unaligned systems to guarantee alignment with the desired outcome. Ultimately, AI alignment proposals must consider the issue of self-unalignment to prevent potential catastrophic consequences.

Let me know any issues you see with this summary.  You can use the agree/disagree voting to help rate the quality of the summary if you don't have time to comment - you'll be helping to improve the script for summarizing future posts. I haven't had a chance to read this article in full yet (hence my interest in generating a summary for it!). So I don't know how good this particular summary is yet, though I've been testing out the script and improving it on known texts.

comment by johnjdziak · 2023-04-15T15:00:29.898Z · LW(p) · GW(p)

Thank you for this!   I have sometimes wondered whether or not it's possible for even a superhuman AI to meaningfully answer a question as potentially undetermined as "What should we want you to do?"   Do you think that it would be easier to explain things that we're sure we don't want (something like a heftier version of Asimov's Laws)?  Even then it would be hard (both sides of a  war invariably claim their side is justified; and maybe you can't forbid harming people's mental health unless you can define mental health), but maybe maybe sufficient to avoid doomsday until we thought of something better?