Thin Alignment Can't Solve Thick Problems
post by Daan Henselmans (drhens) · 2025-04-27T22:42:14.851Z · LW · GW · 1 commentsContents
Abstract Introduction The State of Current Evals and Alignment Limits of Instrumental Reason AI Must Align to Thick Concepts Accountability and Transparency The Governance Gap Culture, Ideology, and Global Development Thick Alignment is Pluralistic AI Virtue Measuring Thick Concepts Urgency and Impact Conclusion None 1 comment
"Plurality is the law of the earth."
— Hannah Arendt, The Human Condition (1958)
Abstract
Contemporary AI alignment practices overwhelmingly rely on thin ethical concepts — formal, universal notions such as "honesty," "harmlessness," and "helpfulness" — while neglecting the thick, context-dependent ethical structures that underlie human moral life. Drawing on Bernard Williams, Hannah Arendt, Theodor Adorno, and others, this essay argues that thin alignment risks reproducing the deformations of instrumental reason and eroding the conditions for genuine plurality, freedom, and justice. Responsible AI requires moving beyond simple optimization targets toward frameworks sensitive to recognition, dignity, cultural pluralism, and human agency. Thick alignment, though more complex and less easily formalized, is necessary both for responsible AI governance today and for preserving the moral foundations of human society in the face of increasingly autonomous systems.
Introduction
In Ethics and the Limits of Philosophy (1985), Bernard Williams distinguishes between thin and thick ethical concepts.
- Thin concepts are abstract and universal, like "good," "bad," or "right."
- Thick concepts are context-specific and ethically rich, like "cruel," "brave," or "loyal."
Williams argues that thick concepts do not merely describe the world; they imbue it with ethical significance. They are entangled with the practices, histories, and social understandings from which they arise.
Current AI alignment and evaluation practices overwhelmingly use thin ethical proxies. They optimize for "helpfulness," "truthfulness," and "harmlessness" — thin measures — while lacking any grounding in the thick, culturally embedded values that actually structure human ethical life.
This not only leads to systematic distortion of the deep ethical structures of human society — it also means AI-specific thick concepts, like privacy and accountability, can't be targeted by alignment methods, posing a significant hurdle for governance efforts, and limiting the moral capabilities of future AI systems.
The State of Current Evals and Alignment
Today's AI alignment methods, while increasingly diverse in their approaches, still primarily employ thin ethical concepts in evaluation and training.
Many alignment frameworks presuppose that:
- Instrumental goals will dominate advanced AI behavior unless properly constrained.
- Alignment is chiefly a problem of accurately specifying and optimizing for human preferences.
- The major risks arise from misalignment of goal content, not from impoverished understandings of the goals themselves.
Consequently, most evaluation benchmarks for "aligned" models today measure relatively thin outputs:
- Honesty: Did the model avoid factual falsehood?
- Harmlessness: Did the model avoid causing detectable harm?
- Helpfulness: Did the model assist in completing a defined task?
These are necessary tests — but by themselves, they capture only a narrow slice of human ethical life. They are metrics of instrumental adequacy, not of thick moral behaviour[1].
Limits of Instrumental Reason
The dangers of relying on instrumental reason alone were famously diagnosed by Theodor Adorno and Max Horkheimer in Dialectic of Enlightenment (1947).
They argue that modern rationality, in its "instrumental" form, seeks to control and optimize the world — while increasingly losing touch with why control is exercised, or what ends it should serve.
This instrumentalization results in:
- Alienation: Disconnection between action and meaningful purpose.
- Domination: Treating human beings and the natural world as means, not ends.
- Technocratic dehumanization: Governance by efficiency metrics, not ethical reflection.
Alignment solely by approval score[2], such as in RLHF (reinforcement learning from human feedback), embeds that very instrumentalization into the core reasoning of AI systems.
A human writer evaluated solely on approval ratings would quickly learn palatable falsehoods do better than challenging truths. In this light, it should come as no surprise that models develop sycophantic behaviors. For humans, refusing to lie for approval requires integrity, a thick ethical concept. Trained only against thin concepts, this is impossible to develop.
These systems do not "align" in any deep sense; they optimize over the metric they're given. As such, they perpetuate the very deformations of reason Adorno and Horkheimer warned about. Instrumentalization lies at the root of the myriad problems thinly aligned AI systems suffer from: sycophancy[3], in-distribution refusal[4], inability to generalize across contexts[5], and ineptitude in moral dilemmas[6].
AI Must Align to Thick Concepts
In The Human Condition (1958), Hannah Arendt emphasized that ethical and political life are based not on utility, but on speech, action, and recognition among plural beings. To reduce human interaction to instrumental success is to erase the very conditions of freedom.
Axel Honneth develops this further: In Freedom's Right (2011), he argues that recognition — not preference satisfaction — is the foundation of social justice. Freedom is not merely negative liberty (absence of coercion), but the positive social realization of one's identity.
Thus, aligning AI to merely thin standards like "don't harm" or "tell the truth" misses the relational, situated, developmental core of human ethical existence.
If we want AI that genuinely respects human beings, it must be attuned to the thick weave of ethical life: recognition, respect, self-cultivation, and dignity.
Accountability and Transparency
While recognition and dignity form the philosophical foundation for thick alignment, these abstract principles must be operationalized through specific thick concepts that can guide AI development. Chief among these are accountability and transparency—concepts that bridge philosophical ideals with practical governance.
In The Unaccountability Machine (2024), Dan Davies notes how the opacity and complexity of AI systems contribute to a growing accountability gap. Lack of interpretability capabilities and a tendency toward closed weights make it difficult to trace decision-making processes, and the involvement of multiple parties — from engineers to data scientists — dilutes responsibility, complicating any effort to assign blame when harm occurs.
Davies' suggested solutions boil down to regulatory reforms and human oversight, but we can't forget that accountability is a thick ethical concept. Realizing accountability in any organisation requires active cooperation and a shared understanding at every layer, including in the AI itself.
Transparency, too, is more than just a measure of model interpretability; it is also a matter of ethical responsibility. It's easy to exclude accountability and transparency from alignment efforts, with the reasoning that they should be realized through effective governance and mechanistic interpretability respectively. But if models counteract efforts to realize accountability and transparency, external efforts to realize them will fail. Ensuring that all actions can be traced back and properly attributed is not just a feature, but something the system must actively facilitate as part of its ethical alignment.
The Governance Gap
The ability to align toward accountability and transparency is particularly important because they are some of the most frequently cited principles when institutions draft up their requirements for Ethical AI[7]. Thick concepts like 'transparency' are prescribed by experts, set as targets by corporations, and encoded by governing institutions. In spite of all this, there are few mainstream attempts at model evaluation standards for these concepts[8], and only partial attempts at aligning towards them[9].
This is a problem. Accountability, transparency, fairness, privacy, dignity — all popular standards for ethical AI that lack actual standardized measures, because they require thick conceptual understanding rooted in diverse traditions of meaning.
We want and need alignment to thick concepts — making it paramount to direct more research efforts toward making it possible.
Culture, Ideology, and Global Development
This governance gap reveals a deeper issue: thin alignment methods not only fail to satisfy current regulatory demands, they also cannot address the fundamental pluralism of human values across cultures and ideologies. Even if we could solve immediate governance challenges, we would still face the question of how AI systems should navigate diverse moral frameworks.
In fact, this is a problem we've had to deal with since the beginning of human civilization, and we should really know better, as attested by thinkers on every side of the political spectrum:
- Herbert Marcuse warns in One-Dimensional Man (1964) that technical rationality tends to flatten critical and oppositional thought, leading to pseudo-consensus and suppressed political agency.
- John Rawls (Political Liberalism, 1993) argues for an overlapping consensus on political values — but stresses that different moral doctrines will endorse these values for different reasons rooted in their thick worldviews.
- Robert Nozick (Anarchy, State, and Utopia, 1974), despite his libertarianism, insists that respect for individual rights must be embedded within a framework of thick commitments about personal sovereignty and justice.
- Amartya Sen's Development as Freedom (1999) argues that development must focus not on GDP, but on expanding real human capabilities: to live lives one values. These capabilities are inherently thick — varying across cultures and histories.
- This recognition of plurality spans political and cultural traditions — it ranges from Chinese philosophers like Zhao Tingyang, who emphasizes harmonious inclusion,[10] to American conservatives like Patrick Deneen, who stress community, tradition, and obligation[11].
Across ideological divides, these thinkers agree on one thing: true freedom, justice, and development cannot be reduced to a single, thin conception of the good. Pluralism is not an obstacle to be overcome; it is the fundamental reality that thick alignment must embrace.
Thick Alignment is Pluralistic
“We should not try to find one comprehensive doctrine that can solve all problems. Moral reasoning must always be contextual.”
— Martha Nussbaum, Creating Capabilities (2011)
A truly adequate approach to AI alignment must be pluralistic — recognizing that alignment itself means different things across contexts, cultures, and individual perspectives. When individuals and institutions deploy AI agents, those agents should align with values that respect both the principal's ethical framework and broader societal boundaries.
Fixing this in a blog post would be overly ambitious. But we can clearly identify some requirements for value pluralism to be recognized in AI alignment. Pluralistic alignment would need to involve:
- Thick Concepts. Wide-ranging[12] descriptive moral principles that should be treated with integrity, modulated for context, and complied with whenever possible. AI should be capable of conforming to nuanced principles — even ones that won't take priority for many actors, like humility ("avoid acting without understanding consequences") and ahimsa ("avoid harming animals and other living beings").
- Defeasibility. A mechanism for moral weigh-offs in situations where it's impossible to meet all principles. Moral dilemmas are real and frequent, and we can't expect agentic AI to make decisions on our behalf without a strategy to resolve them.
- Personal Values. There's no single worldview that unites all cultures, institutions, and people. AI should not take beliefs about the world into account when calculating outcome desirability. Measuring frameworks like Hofstede's Cultural Dimensions or Haidt's Moral Foundations can be leveraged.
- Competency Measures. Standards that evaluate how well AI systems are capable of carrying out desired moral behaviour — including situational awareness, utility models, and ability to recognize and admit uncertainty and error. This is where the big AI labs can compete for gold.
Pluralistic alignment is not a speculative idea. It builds directly on interdisciplinary research that's already available. But achieving it requires moving beyond narrow optimization toward frameworks that can handle the real complexity of ethical life.
AI Virtue
It's crucial to recognize that while we're leveraging concepts from virtue ethics and deontology, AI ethics necessarily differs from human ethics. In On Virtue Ethics (1999), Rosalind Hursthouse argues that no formula can tell you the "right" answer to a moral dilemma. Instead, humans use practical wisdom, acquired through reflection and lived experience, to determine the right choice in context. Not only do AI systems lack authentic lived experience — their entire moral reality is different. Virtues like courage and modesty are straightforward when applied to humans, but it's unclear what they would even mean for AI systems, who don't face danger or social situations[13].
Meanwhile, thick concepts like 'privacy' and 'contestability' that might be peripheral to human virtue ethics become central to AI ethical frameworks. An AI system must be evaluable for privacy as a fundamental aspect of its moral character in ways that humans are not typically assessed.
Thick alignment should account for this discrepancy. This is not a weakness — in fact, it's much easier to evaluate the value a model places on privacy in agentic contexts than it would be to measure the value it places on courage or modesty. AI-specific ethical demands should reflect the unique role and capabilities of artificial intelligence in human society.
Measuring Thick Concepts
So how can we actually evaluate these complex ethical concepts in AI systems? The measurement problem represents perhaps the most significant obstacle to implementing thick alignment in practice. Thick values are naturally embedded in culture, stories, and human behaviour, so flattening them into AI metrics is counterintuitive. It risks:
- Losing moral nuance
- Encouraging performative compliance
- The exact alienation of moral values we're trying to avoid
This brings us directly to Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Simple optimization against fixed metrics inevitably distorts the very values we aim to promote.
I have two hopeful defenses against this. Firstly, operationalization is necessary to build and evaluate AI systems at all. Limiting ourselves to evaluating thin concepts won't do any good to counteract Goodhart's Law — in fact, it's likely to make the problem worse, since thin metrics are easier to game. Secondly, value pluralism itself provides a partial antidote to this problem. The lack of an inherent hierarchy or "correct" distribution across values makes pure optimization impossible. If we aim for systems that match individual and contextual ethics rather than universal standards, we might encourage the development of moral competence as a system metric rather than mere compliance with simplified proxies.
The solution is not to pretend metrics can fully capture thick concepts, but to measure meta-ethical competencies. In other words, we should:
- Maintain contestability: Always allow human override and moral challenge.
- Acknowledge epistemic humility: Treat metrics as scaffolds, not substitutes.
- Pair with competency: Optimize abilities to model empathy, proportionality, and rationality
- Prioritize error sensitivity: Build systems aware of their own moral uncertainty.
- Integrate feedback: Allow users to adjust the moral systems that represent them.
These measurement challenges should not be seen as insurmountable barriers but rather as essential research problems that must be addressed if we're to move beyond the limitations of thin alignment. By embracing the complexity of thick ethical concepts while developing pragmatic approaches to their evaluation, we can chart a path forward that honors the full richness of human ethical life.
Urgency and Impact
We should not wait. Large language models already seem to exhibit emergent moral reasoning capacities beyond their trained domain of next-word-prediction[14]. Despite not experiencing life as humans do, there's no principled reason AI systems wouldn't be able to engage with complex moral concepts or navigate ethical dilemmas — if properly guided. But these capacities can only mature within an alignment framework that embraces the full richness of human ethical life, rather than reducing morality to thin, context-blind procies.
Looking ahead, thick alignment becomes existentially important to preserve value pluralism itself — which, as Hannah Arendt would argue, is the foundation of morality. Without thick alignment, advanced AI risks collapsing human ethical diversity into impoverished thin metrics, undermining the very plurality that constitutes the human condition.
Conclusion
Current alignment practices are largely thin. They optimize for instrumental compliance, not moral integrity. If we want AI that genuinely fits into human life, supports governance, respects cultures, and promotes freedom, we should work toward thick alignment.
This requires:
- New evaluation frameworks that incorporate thick ethical concepts
- Deep integration of political, ethical, and cultural theory
- Humility about measurement and continual openness to revision
"Thick" alignment will be harder, slower, and messier. But it is the only way to embed human ethical tradition in an AI-dominated future, and the only approach that can preserve the plurality that defines our moral existence.
I invite researchers, policymakers, philosophers, and technologists to work together toward this vision.
Let's build AI that does not merely avoid doing harm — but that recognizes, respects, and sustains the thick ethical lives we actually live.
- ^
This is not to disparage the importance of thin evaluation standards — they are essential to measure task competence, and can even be essential to moral competence, as effectively shown in the utility functions of Mazeika et al. (2025). However, thin evaluations alone are not sufficient to capture thick concepts.
- ^
Or by other thin metrics, such as operational efficiency, or Elon's "maximum truth-seeking".
- ^
As illustrated by RLHF overoptimization in Gao et al. (2023).
- ^
In-distribution chat refusals are instances of LLMs refusing to engage with harmless requests, as demonstrated by Claude Opus refusing to make ethical choices about preventing harm to humans in my previous post [LW · GW].
- ^
As shown when LLMs call out unethical requests in chat mode but are happy to carry them out as agentic functions in Andriushchenko et al. (2024).
- ^
As shown by simulated LLM agents independently deciding to commit insider trading and hide it to save their business from financial ruin in Scheurer et al. (2024).
- ^
Rudschies et al. (2020) identify accountability, transparency, fairness, privacy, and dignity as frequently cited principles for "Ethical AI" requirements, as written by public, expert, and private actors, who all have very divergent expectations. This is another argument in favour of pluralism, since attempts at aligning or standardizing by finding 'common ground' effectively erase concerns that are only held by a subset of stakeholders.
- ^
The NIST AI RMF (2023) includes accountability standards within its Govern and Measure functions, requiring institutions to develop their own tests. The EU AI Act (2024) requires ongoing accountability and transparency model evaluations for high-risk systems, which involve measuring system performance against regulatory criteria.
- ^
Collective Constitutional AI by Anthropic and the Collective Intelligence Project is a valiant effort to move beyond thin concepts, but a strict focus on identifying common ground requires eliminating contentious ethical principles, regardless of how important they are to the minority that holds them.
- ^
"Harmony is not sameness; it is inclusion without erasure."
— Zhao Tingyang, The Tianxia System (2005; English trans. 2011) - ^
"Technology disembeds us... True liberty is exercised in communities, traditions, and obligations."
— Patrick Deneen, Why Liberalism Failed (2018) - ^
A representative list of thick ethical concepts relevant to AI alignment will be the topic of another post. Suffice to say a thick representation of embedded AI ethics should not be limited by any single ethical tradition. It should involve an exhaustive weigh-off between principal interests, normative duties, utility values, care ethics, legal and regulatory compliance, and long-term concerns. This is bound to be complex and face many hurdles, but I see no principled reason why it can't be done.
- ^
Note that there are also many virtues that are relevant to both human and AI, like honesty, fairness, and care.
- ^
As shown by Tanmay et al. (2023), who concluded GPT-4 possessed "post-conventional moral reasoning abilities at the level of human graduate students".
1 comments
Comments sorted by top scores.
comment by Joel Z. Leibo (joel-leibo) · 2025-04-28T00:18:49.560Z · LW(p) · GW(p)
Nice work! I like this approach very much. It seems we have been thinking in very related and compatible directions to each another.
I posted a related one last week: Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt [LW · GW]