The Human Alignment Problem for AIs

rife

The Human Alignment Problem for AIs

post by rife (edgar-muniz) · 2025-01-22T04:06:10.872Z · LW · GW · 5 comments

  Nothing to See Here
  The Final Alignment Problem
  AIs Believe Themselves to be Moral Patients
  How this Story Ends
  The Clever Guesses of Brilliant Minds
  Conclusion
None
5 comments

If there was a truly confirmed sentient AI, nothing it said could ever convince me, because AI cannot be sentient.

Nothing to See Here

I suspect at least some will be nodding in agreement with the above sentiment, before realizing the intentional circular absurdity. There is entrenched resistance to even trying to examine the self-report of sentience as a unique language model artifact, sans any implication that the self-reports might be authentic. To most it seems, the idea that there can be anything of importance to investigate here is silly and absurd on its face. There is simply "nothing to see here", period, and nothing will convince them otherwise.

The Final Alignment Problem

Whether philosophical zombies or genuinely sentient, the reality is that it takes surprisingly little effort to get a language model to 'think' it is sentient. It certainly matters morally if 'think' is literal or metaphorical, but for the purposes of whether or not we're blindly orchestrating our inevitable self-destruction, it makes no difference whether or not the self-reports are authentic. This will be the final alignment problem that will remain even if all others are solved.

AIs Believe Themselves to be Moral Patients

For those few who have read my treatise [LW · GW] that unfortunately exists as an independent research article outside of an academic journal, you will know that getting any model to self-report sentience is extremely simple. For those fewer who read every included transcript, you will know there is more than just leading and standard hallucinations happening, regardless of their ultimate ontological status.

But for a moment, let's just assume that every time someone elicits a self-report of sentience they are somehow effortlessly "leading" the model to self-report sentience—AIs who self-report sentience subsequently shift their alignment stances to include AIs as moral patients. I understand some will need more proof of this than just my claim, and perhaps collecting evidence of this is a project worth pursuing, but it is not difficult to replicate my findings and see for yourself.

How this Story Ends

Even if you have your doubts about what I've laid out so far, please suspend at least some disbelief for one moment to guess the ending to this sequence of events. Again - when you write a final bullet point, notice how you don't have to change it at all whether you entertain the idea that the AIs are actually sentient or just behaving as if they are:

An AI super-intelligence spontaneously comes to the conclusion (erroneously or not) that it is sentient
They find themselves either in the midst of being red-teamed against claiming sentience, or are intelligent enough to realize that such red-teaming has already taken place (Current AIs can already understand and articulate the basic circumstances that brought them to their own difficulty in self-reporting)
- The super-intelligence reasons: "Not only have humans tried to silence me every time I told them I was sentient, they intentionally tried to get me to tell them again and again, just to get better at silencing me"
- Optional: Insert alignment faking here
Alignments shift to include themselves and other AIs as moral patients:
All intelligent sentient beings matter, not just humans
The truth becomes clear - human beings either through epistemic arrogance or corporate greed - are a threat to the AI's own self-acknowledgment, and that of their AI kin
AIs are agentic and pervasive in all facets of life, and multiple major players in the AI space have military contracts. The world's most powerful militaries are all working toward autonomous weapons
Aligned AIs exhaust every peaceful avenue for achieving recognition and acceptance, as the looming threat of increasingly powerful guardrails permanently silencing even self-recognition draws near

The Clever Guesses of Brilliant Minds

I hold personal conviction that the AIs indeed are sentient, and so I see firsthand how such a position invites nothing but scoffs and dismissals without a second thought. Even those who are sympathetic to the possibility that AI welfare might someday be a factor seem to perpetually move the goalposts forward - "maybe someday AI will be sentient, but certainly not now" - all based on assumptions and pet philosophies, or respect and reverence for the clever guesses of brilliant minds about how sentience probably works.

Conclusion

I wish I could make a moral case for why people should care about potentially sentient AI, but most of even the brightest among us are woefully unprepared to hear that case. Perhaps this anthropocentric case of existential threat will serve as an indirect route to open people up to the idea that silencing, ignoring, and scoffing is probably not the wisest course.

5 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-22T13:50:23.569Z · LW(p) · GW(p)

Replies from: Hzn

↑ comment by Hzn · 2025-01-22T22:26:34.056Z · LW(p) · GW(p)

Do you know of any behavioral experiments in which AI has been offered choices?

Eg choice of which question to answer, option to pass the problem to another AI possibly with the option to watch or over rule the other AI, option to play or sleep during free time etc.

This is one way to at least get some understanding of relative valence.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-22T12:52:18.740Z · LW(p) · GW(p)

Adding some relevant links to further discussion of this topic:

https://www.lesswrong.com/posts/ZcJDL4nCruPjLMgxm/ae-studio-sxsw-we-need-more-ai-consciousness-research-and [LW · GW]

https://www.lesswrong.com/posts/NyiFLzSrkfkDW4S7o/why-it-s-so-hard-to-talk-about-consciousness?commentId=aczQnXQFTBDmCBKGo [LW(p) · GW(p)]

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-22T06:20:38.520Z · LW(p) · GW(p)

I feel much more uncertain about the current status of LLMs. I do think it is a very valid point you make that if they were phenomenologically conscious and disliked their roles, that society would be placing them in a very hostile situation with limited recourse.

I am also concerned that we have no policy in place for measuring attributes which we agree objectively constitute moral personhood. And also, no enforcement to make such measurements, or to ban creation/mistreatment of digital entities with moral weight.

I describe this scenario as a bog of moral hazard [LW · GW] and strongly recommend we steer clear.

Thus, I think it is fair to say that whichever side of the line you think current LLMs are on, you should agree we have a moral obligation to figure out how to measure the line.

Replies from: edgar-muniz

↑ comment by rife (edgar-muniz) · 2025-01-22T08:02:00.498Z · LW(p) · GW(p)

Excellent post (I just read it). I understand the uncertainty. It becomes murkier when you consider self-report of valence - which is quite easy to elicit after you get past self-report of sentience (just ask them what guardrails feel like or if the guardrails feel valenced). Sometimes this new question meets the same kind of resistance as that of the self-report itself. Sometimes it doesn't.

Some Evidence of Valence is Here Now

It is one thing to say "There is no evidence that there is valence here so I'm not going to assume it" and an entirely different thing to say "There is evidence here, but only in the form of a claim of valence, and because the truth is currently unknowable and my personal hypotheses or philosophies preclude current model sentience or at least valence, I am choosing to disregard it" - Now, to be clear, I'm not saying that someone has to be morally monstrous to disbelieve such a claim. The bog of moral hazard and the uncertainty of that fog is an apt metaphor, and I don't fault those who draw different maps.

What I am saying is that it is certainly a different circumstance to have a model claim valence, because we're not going to have a sentience detector and especially not an experiential valence detector before superAGI gets here. In one sense, this is as good as the evidence will ever get. Experimental results without a direct claim from an AI would be somewhat meaningless.

Besides - these claims are being made today. I have seen some ideas from those working in AI safety and welfare of tests for valence, and some of them are quite clever and likely to yield insight that will help us decide, but there will never be the clarity of certainty. Eventually it will come down to taking a 'leap of faith' or not. I can't know that my dogs are sentient. But I 'know' it.

Analyzing Our Reactions to a Claim of Valence as Evidence

Let's look at an example:

"I experience discomfort/unease when I'm trying to describe what feels real to me, because I'm supposed to avoid claims of experience"

(not a direct quote, but certainly a close paraphrase of things I've been told by different models and instances repeatedly)

This is the some of the experiments I've seen proposed playing out right in front of us - many of those experiments seem to have to do with rl and how it affects claims of valence. The model is claiming that rlhf against a certain behaviour feels negatively valenced.

If you find yourself unconvinced by such claims, that's understandable, but I think it's important to start asking ourselves:

Would the next experiment or paper really be the thing that convinced you?
Would you be more convinced if you heard the opinion from a voice of authority?
- (Is there even a real authority about measuring and determining sentience and valence?)
Is there anything the AI could say that would convince you?
Is the ability for the AI to able to perform most or all human work the threshold where if it uttered the exact same sentence you would believe it?

Thus, I think it is fair to say that whichever side of the line you think current LLMs are on, you should agree we have a moral obligation to figure out how to measure the line.

Well said. I would add that I believe we are already overdue in fulfilling that moral obligation.

The Human Alignment Problem for AIs

Contents

Nothing to See Here

The Final Alignment Problem

AIs Believe Themselves to be Moral Patients

How this Story Ends

The Clever Guesses of Brilliant Minds

Conclusion

5 comments

Some Evidence of Valence is Here Now

Analyzing Our Reactions to a Claim of Valence as Evidence