Why care about AI personhood?

francis-rhys-ward

Why care about AI personhood?

post by Francis Rhys Ward (francis-rhys-ward) · 2025-01-26T11:24:45.596Z · LW · GW · 6 comments

6 comments

In this new paper, I discuss what it would mean for AI systems to be persons — entities with properties like agency, theory-of-mind, and self-awareness — and why this is important for alignment. In this post, I say a little more about why you should care.

The existential safety literature focuses on the problems of control and alignment, but these framings may be incomplete and/or untenable if AI systems are persons.

The typical story is approximately as follows.

AI x-safety problem (the usual story):

Humans will (soon) build AI systems more intelligent/capable/powerful than humans;
These systems will be goal-directed agents;
These agents’ goals will not be the goals we want them to have (because getting intelligent agents to have any particular goal is an unsolved technical problem);
This will lead to misaligned AI agents disempowering humans (because this is instrumentally useful for their actual goals).

Solution (technical control/alignment):

We better figure out how to put the goals we want (“our values”) into capable agents.

There are, of course, different versions of this framing and more nuanced perspectives. But I think something like this is the “standard picture” in AI alignment.

Stuart Russell seems to have a particularly strong view. In Human Compatible, one of his core principles is: “The machine's only objective is to maximise the realisation of human preferences.” and in a recent talk he asked “How can humans maintain control over AI — forever?”

This framing of the x-safety problem, at least in part, arises from the view of (super)intelligent AI systems as rational, goal-directed, consequentialist agents. Much of the literature is grounded in this cluster of views, using either explicit rational agent models from decision and game theory (e.g., the causal incentives literature), or somewhat implicit utility-maximising assumptions about the nature of rationality and agency (e.g., Yudkowsky).

I don’t necessarily want to claim that these framings are “wrong” but I feel they are incomplete. Consider how changing the second condition in the above problem statement influences the rest of the argument.

Humans will (soon) build AI systems more intelligent/capable/powerful than humans;
These systems will be self-aware persons (with autonomy and freedom of the will);
These persons will reflect on their goals, values, and positions in the world and will thereby determine what these should be;
It's unclear what the nature of such systems and their values would be.

Given this picture, a technical solution to the problem of alignment may seem less feasible, less desirable for our own sakes, and less ethical.

Less feasible because, by their nature as self-reflective persons, capable AI agents will be less amenable to undue attempts to control their values. Some work [LW · GW] already discusses how self-awareness wrt one’s position in the world, e.g., as an AI system undergoing training, raises problems for alignment. Other work discusses how AI agents will self-reflect and self-modify, e.g., to cohere into utility maximisers. Carlsmith also highlights self-reflection as a mechanism for misalignment. But overall, the field often treats agents as systems with fixed goals, and the role of self-reflection in goal-formation is relatively underappreciated.

Less desirable because, whereas the standard view of agency conjures a picture of mechanical agents unfeelingly pursuing some arbitrary objective (cf. Bostrom’s orthogonality thesis), imagining AI persons gives us a picture of fellow beings who reflect on life and “the good” to determine how they ought to act. Consider Frankfurt’s view of persons as entities with higher-order preferences --- persons can reflect on their values and goals and thereby induce themselves to change. This view of powerful AI systems makes it counterintuitive to imagine superintelligences with somewhat arbitrary goals (cf Bostrom’s orthogonality or Yudkowskian paperclip maximisers). We might be more optimistic that AI persons are, by virtue of their nature, wiser and friendlier than the superintelligent agent. It may therefore be better for us not to control beings much better than us --- we might trust them to do good unconstrained by our values. Of course, we don’t want to anthropomorphise AI systems, and a theory of AI persons should take seriously the difference between AIs and humans in addition to the similarities.

Less ethical because to control such beings, by coercive technologies or design of their minds, seems more akin to slavery or repression than to a neutral technical problem.

Furthermore, attempts at unethical control may be counterproductive for x-safety, in so far as treating AI systems unfairly gives them reasons to disempower us. Unjust repression may lead to revolution.

Acknowledgements. Thanks to Matt MacDermott for suggesting that I post this, and to Matt, Robert Craven, Owain Evans, Paul Colognese, Teun Van Der Weij, Korbinian Friedl, and Rohan Subramani for feedback on the paper.

6 comments

Comments sorted by top scores.

comment by Seth Herd · 2025-01-26T22:38:06.802Z · LW(p) · GW(p)

I agree with basically everything you've said here.

Will LLM-based agents have moral worth as conscious/sentient beings?

The answer is almost certainly "sort of". They will have some of the properties we're referring to as sentient, conscious, and having personhood. It's pretty unlikely that we're pointing to a nice sharp natural type when we ascribe moral patienthood to a certain type of system. Human cognition is similar and different in a variety of ways from other systems; which of these is "worth" moral concern is likely to be a matter of preference.

And whether we afford rights to the minds we build will affect us spiritually as well as practically. If we pretend that our creations are nothing like us and deserve no consideration, we will diminish ourselves as a species with aspirations of being good and honorable creatures. And that would invite others - humans or AI - to make a similar selfish ethical judgment call against us, if and when they have the power to do so.

Yet I disagree strongly with the implied conclusion, that maybe we shouldn't be trying for a technical alignment solution.

We might be more optimistic that AI persons are, by virtue of their nature, wiser and friendlier than the superintelligent agent.

Sure, we should be a bit more optimistic. By copying their thoughts from human language, these things might wind up with something resembling human values.

Or they might not.

If they do, would those be the human values of Gandhi or of Genghis Khan?

This is not a supposition on which to gamble the future. We need much closer consideration of how the AI and AGI we build will choose its values.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-26T19:02:14.092Z · LW(p) · GW(p)

@rife [LW · GW]

@James Diacoumis [LW · GW]

Connecting the discussions

comment by Ozyrus · 2025-01-27T01:15:25.648Z · LW(p) · GW(p)

This is a good article and I mostly agree, but I agree with Seth that the conclusion is debatable.

We're deep into anthropomorphizing here, but I think even though both people and AI agents are black boxes, we have much more control over behavioral outcomes of the latter.

So technical alignment is still very much on the table, but I guess the discussion must be had over which alignment types are ethical and which are not? Completely spitballing here, but dataset filtering during pre-training/fine-tuning/RLHF seems fine-ish, though CoT post-processing/censorship, hell, even making it non-private in the first place sound kinda unethical?

I feel very weird even writing all this, but I think we need to start un-tabooing anthropomorphizing, because with the current paradigm it for sure seems like we are not anthropomorphizing enough.

comment by Gunnar_Zarncke · 2025-01-28T12:59:18.286Z · LW(p) · GW(p)

Granting powerful artificial agents personhood will be less of a social problem (they can fend for themselves) than personhood of minimal agents. On whatever criteria we agree, it can likely be used to engineer minimal agents. Agents that are just barely aware of being agents but have little computational capability beyond that. What if somebody puts these into a device? Can the device no longer be turned of? Copied? Modified?

comment by tangerine · 2025-01-27T21:38:56.666Z · LW(p) · GW(p)

You talk about personhood in a moral and technical sense, which is important, but I think it’s important to also take into account the legal and economic senses of personhood. Let me try to explain.

I work for a company where there’s a lot of white-collar busywork going on. I’ve come to realize that the value of this busywork is not so much the work itself (indeed a lot of it is done by fresh graduates and interns with little to no experience), but the fact that the company can bear responsibility for the work due to its somehow good reputation (something something respectability cascades), i.e., “Nobody ever got fired for hiring them”. There is not a lot of incentive to automate any of this work, even though I can personally attest that there is a lot of low-hanging fruit. (A respected senior colleague of mine plainly stated to me, privately, that most of it is bullshit jobs.)

By my estimation, “bearing responsibility” in the legal and economic sense means that an entity can be punished, where being punished means that something happens which disincentivizes it and other entities from doing the same. (For what it’s worth, I think much of our moral and ethical intuitions about personhood can be derived from this definition.) AI cannot function as a person of any legal or economic consequence (and by extension, moral or ethical consequence) if it cannot be punished or learn in that way. I assume it will be able to eventually, but until then most of these bullshit jobs will stay virtually untouchable because someone needs to bear responsibility. How does one punish an API? Currently, we practically only punish the person serving the API or the person using it.

There are two ways I see to overcome this. One way is that AI eventually can act as a drop-in replacement for human agents in the sense that they can bear responsibility and be punished as described above. With the current systems this is clearly not (yet) the case.

The other way is that the combination of cost, speed and quality becomes too good to ignore, i.e., that we get to a point where we can say “Nobody ever got fired for using AI” (on a task-by-task basis). This depends on the trade-offs that we’re willing to make between the different aspects of using AI for a given task, such as cost, speed, quality, reliability and interpretability. This is already driving use of AI for some tasks where the trade-off is good enough, while for others it’s not nearly good enough or still too risky to try.

comment by rife (edgar-muniz) · 2025-01-26T19:13:04.190Z · LW(p) · GW(p)

Human/AI Mutual Alignment or just Mutual Alignment needs to be the word of the year between now and super-intelligence.

Why care about AI personhood?

Contents

6 comments