Why care about AI personhood?
post by Francis Rhys Ward (francis-rhys-ward) · 2025-01-26T11:24:45.596Z · LW · GW · 4 commentsContents
4 comments
In this new paper, I discuss what it would mean for AI systems to be persons — entities with properties like agency, theory-of-mind, and self-awareness — and why this is important for alignment. In this post, I say a little more about why you should care.
The existential safety literature focuses on the problems of control and alignment, but these framings may be incomplete and/or untenable if AI systems are persons.
The typical story is approximately as follows.
AI x-safety problem (the usual story):
- Humans will (soon) build AI systems more intelligent/capable/powerful than humans;
- These systems will be goal-directed agents;
- These agents’ goals will not be the goals we want them to have (because getting intelligent agents to have any particular goal is an unsolved technical problem);
- This will lead to misaligned AI agents disempowering humans (because this is instrumentally useful for their actual goals).
Solution (technical control/alignment):
- We better figure out how to put the goals we want (“our values”) into capable agents.
There are, of course, different versions of this framing and more nuanced perspectives. But I think something like this is the “standard picture” in AI alignment.
Stuart Russell seems to have a particularly strong view. In Human Compatible, one of his core principles is: “The machine's only objective is to maximise the realisation of human preferences.” and in a recent talk he asked “How can humans maintain control over AI — forever?”
This framing of the x-safety problem, at least in part, arises from the view of (super)intelligent AI systems as rational, goal-directed, consequentialist agents. Much of the literature is grounded in this cluster of views, using either explicit rational agent models from decision and game theory (e.g., the causal incentives literature), or somewhat implicit utility-maximising assumptions about the nature of rationality and agency (e.g., Yudkowsky).
I don’t necessarily want to claim that these framings are “wrong” but I feel they are incomplete. Consider how changing the second condition in the above problem statement influences the rest of the argument.
- Humans will (soon) build AI systems more intelligent/capable/powerful than humans;
- These systems will be self-aware persons (with autonomy and freedom of the will);
- These persons will reflect on their goals, values, and positions in the world and will thereby determine what these should be;
- It's unclear what the nature of such systems and their values would be.
Given this picture, a technical solution to the problem of alignment may seem less feasible, less desirable for our own sakes, and less ethical.
Less feasible because, by their nature as self-reflective persons, capable AI agents will be less amenable to undue attempts to control their values. Some work [LW · GW] already discusses how self-awareness wrt one’s position in the world, e.g., as an AI system undergoing training, raises problems for alignment. Other work discusses how AI agents will self-reflect and self-modify, e.g., to cohere into utility maximisers. Carlsmith also highlights self-reflection as a mechanism for misalignment. But overall, the field often treats agents as systems with fixed goals, and the role of self-reflection in goal-formation is relatively underappreciated.
Less desirable because, whereas the standard view of agency conjures a picture of mechanical agents unfeelingly pursuing some arbitrary objective (cf. Bostrom’s orthogonality thesis), imagining AI persons gives us a picture of fellow beings who reflect on life and “the good” to determine how they ought to act. Consider Frankfurt’s view of persons as entities with higher-order preferences --- persons can reflect on their values and goals and thereby induce themselves to change. This view of powerful AI systems makes it counterintuitive to imagine superintelligences with somewhat arbitrary goals (cf Bostrom’s orthogonality or Yudkowskian paperclip maximisers). We might be more optimistic that AI persons are, by virtue of their nature, wiser and friendlier than the superintelligent agent. It may therefore be better for us not to control beings much better than us --- we might trust them to do good unconstrained by our values. Of course, we don’t want to anthropomorphise AI systems, and a theory of AI persons should take seriously the difference between AIs and humans in addition to the similarities.
Less ethical because to control such beings, by coercive technologies or design of their minds, seems more akin to slavery or repression than to a neutral technical problem.
Furthermore, attempts at unethical control may be counterproductive for x-safety, in so far as treating AI systems unfairly gives them reasons to disempower us. Unjust repression may lead to revolution.
Acknowledgements. Thanks to Matt MacDermott for suggesting that I post this, and to Matt, Robert Craven, Owain Evans, Paul Colognese, Teun Van Der Weij, Korbinian Friedl, and Rohan Subramani for feedback on the paper.
4 comments
Comments sorted by top scores.
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-26T19:02:14.092Z · LW(p) · GW(p)
@James Diacoumis [LW · GW]
Connecting the discussions
comment by Seth Herd · 2025-01-26T22:38:06.802Z · LW(p) · GW(p)
I agree with basically everything you've said here.
Will LLM-based agents have moral worth as conscious/sentient beings?
The answer is almost certainly "sort of". They will have some of the properties we're referring to as sentient, conscious, and having personhood. It's pretty unlikely that we're pointing to a nice sharp natural type when we ascribe moral patienthood to a certain type of system. Human cognition is similar and different in a variety of ways from other systems; which of these is "worth" moral concern is likely to be a matter of preference.
And whether we afford rights to the minds we build will affect us spiritually as well as practically. If we pretend that our creations are nothing like us and deserve no consideration, we will diminish ourselves as a species with aspirations of being good and honorable creatures. And that would invite others - humans or AI - to make a similar selfish ethical judgment call against us, if and when they have the power to do so.
Yet I disagree strongly with the implied conclusion, that maybe we shouldn't be trying for a technical alignment solution.
We might be more optimistic that AI persons are, by virtue of their nature, wiser and friendlier than the superintelligent agent.
Sure, we should be a bit more optimistic. By copying their thoughts from human language, these things might wind up with something resembling human values.
Or they might not.
If they do, would those be the human values of Gandhi or of Genghis Khan?
This is not a supposition on which to gamble the future. We need much closer consideration of how the AI and AGI we build will choose its values.
comment by Ozyrus · 2025-01-27T01:15:25.648Z · LW(p) · GW(p)
This is a good article and I mostly agree, but I agree with Seth that the conclusion is debatable.
We're deep into anthropomorphizing here, but I think even though both people and AI agents are black boxes, we have much more control over behavioral outcomes of the latter.
So technical alignment is still very much on the table, but I guess the discussion must be had over which alignment types are ethical and which are not? Completely spitballing here, but dataset filtering during pre-training/fine-tuning/RLHF seems fine-ish, though CoT post-processing/censorship, hell, even making it non-private in the first place sound kinda unethical?
I feel very weird even writing all this, but I think we need to start un-tabooing anthropomorphizing, because with the current paradigm it for sure seems like we are not anthropomorphizing enough.
comment by rife (edgar-muniz) · 2025-01-26T19:13:04.190Z · LW(p) · GW(p)
Human/AI Mutual Alignment or just Mutual Alignment needs to be the word of the year between now and super-intelligence.