AI models inherently alter "human values." So, alignment-based AI safety approaches must better account for value drift

post by bfitzgerald3132 · 2025-01-13T19:22:41.195Z · LW · GW · 1 comments

Contents

  Intro and topic overview
  Content recommendation algorithms: why a model’s predictions change its objective
  Northpointe and recidivism prediction: why AI’s impact extends to social/ethical norms
  Signatures and power: why AI is uniquely problematic for value drift
   “So, uh… what do we do with this?”
None
1 comment

Hi all. This post outlines my concerns over the effectiveness of alignment-based solutions to AI safety problems. My basic worry is that any rearticulation of human values, whether by a human or an AI, necessarily reshapes them. If this is the case, alignment is not sufficient for AI safety, since an aligned AI still challenge humanity’s autonomy over its normative conceptions of ethics. Additionally, AI’s ability to represent institutions and its association with “science,” “technology,” and other nebulous (but revered) concepts gives a single model an unprecedented amount of influence over the totality of human values. 

At the very least, value drift should be acknowledged as a structural issue with all alignment solutions. At most, solutions like computational governance should limit the provenance of advanced models as much as possible.

I put TL; DRs after each section - feel free to take advantage!

Intro and topic overview

There’s a lot of discussion in AI alignment literature about feedback loops, bias amplification, and the like. The idea is that if a model makes misaligned predictions, the model’s application reifies this bias, making it more significantly encoded in training data and, by extension, more prevalent in future predictions. 

Brian Christian’s example is crime prediction models. Statistical modeling shows that “crime” (that is, crime as recorded through arrest records), is more likely in low income neighborhoods. But, police departments might react to these models by sending more police to poorer neighborhoods. Doing so heightens arrest rates, as it would in any neighborhood with a greater police presence, which heightens the disparity of recorded crimes between affluent and low-income neighborhoods. As a result, future iterations of crime prediction models correlate income level and arrest rates more significantly; police departments increase their presence in these neighborhoods; and the correlation of crime and income deepens further as the cycle continues.

This picture glosses over two key implications of feedback loops. One is that bias amplification impacts ideology. Take an example of bias amplification in Practical Deep Learning for Coders: social media content recommendation algorithms. YouTube, and other social media websites, learn a user’s preferences from their browsing history and offer new suggestions accordingly. But, the kinds of videos an algorithm recommends alter the distribution of a user’s history: recommending videos on far-right politics makes a user watch them more, which makes future models “learn” that the user is interested in far right politics. This kicks off a feedback loop that shifts the user’s content further right.

In other words, social media recommendations shape a viewer’s interests as much as they represent them. The implicit upshot is that AI models, by altering the content a user consumes, alter the user’s social, political, and ultimately moral views. Consuming more far-right content gives you far right views. In the same way, seeing economically disadvantaged populations getting arrested more frequently can deepen one’s prejudices. Much like the crime prediction models they use, cops learn to see certain areas as “bad neighborhoods” and develop mental associations about the kinds of people who live there. So does everyone watching coverage of yet another kid from “those neighborhoods” getting arrested. It goes without saying that the output from other "models" (be they natural or artifical intelligences) shapes the cognitive, contextual models by which our consciousness attempts to process the world. If we accept this, then changes to what we believe or what we're interested are some of the many impacts of feedback loops.

This ties into the second blind spot of discussions about bias amplification, and my thesis in this post. Discussions about the YouTube algorithm, or any other algorithm subject to feedback loops (I.e. any algorithm with real-world use cases and multiple training iterations), mostly concentrate on cases where alignment fails. Bias amplification in crime prediction models are thought to come from defective, sycophantic models of crime: ones that optimize for the wrong features. What is discussed less is the model that actually succeeds in making “aligned” predictions about crime. Why shouldn’t these predictions alter real-world behavior just as much as defective predictions? Wouldn‘t predicting crime “correctly”—somehow managing to delineate incidental correlations from direct, causal links—alter police behavior, such that future iterations of crime prediction models learn from different distributions?

This claim moves feedback loops from a structural defect in predictive models to the very precondition of their use. If correct predictions are truly just matters of finding the most statistically probable response to a certain training distribution, then, by altering the set of training data, the act of making a prediction changes what it means for a prediction to succeed or fail. In other words: any moral action, or articulation of human values, rewires how we delineate moral and immoral behavior. This phenomenon is an intractable feature of discourse and holds whether the prediction comes from a neurological or statistical model—that is, from an artificial or nature intelligence. 

TL; DR: See above ^

Content recommendation algorithms: why a model’s predictions change its objective

Let’s return to our example of the YouTube content recommendation algorithm. What would it mean for this model of the user’s interests to be “aligned?” One idea is that an aligned content recommendation model should only address features that directly impact a viewer’s interests and ignore any kind of accidental correlations (assuming it’s possible to distinguish between the two). Others might suggest that an aligned model capture a user’s interests as they appear outside of the influence of feedback loops. For the sake of argument, let’s examine an ideal, aligned model tailored to these principles.

For example: before Gail starts watching YouTube, they’re interested in archery, ukulele covers of indie game music, and DIY crafts. They’re looking into these interests further. The ideal model would reach directly into Gail’s brain, find out exactly what interests them about these topics, and use these features to recommend equally appealing videos. And, once it found these videos, it would not factor Gail’s new interests into future recommendations. This approach would be paradigmatic of AI alignment. There would be a standard of values guiding not only what the model should recommend, but how to recommend it. The only problem left would be aligning a model with this vision.

Fast forward to three months of YouTube consumption. The model has captured Gail’s interest before it started using the model, and thereby preserved Gail’s pre-YouTube interests as a sacred object, a trophy uncorroded by AI activity. But in spite of this, the model’s predictions have still changed Gail’s current interests. Its recommended about archery transformed their casual interest into an obsession. Whereas they used to only read Green Arron comics, they have now joined a local archery society and are competing in a tournament next month. Meanwhile, listening to ukulele covers of the Celeste soundtrack has become a bit grating, and Gail is no longer interested in them. 

What’s happened is that the model’s predictions have changed Gail’s preferences, such that it can only satisfy its meta-objective by failing its mesa-objective. It can only satisfy the spirit of its request by ignoring its literal directives, and it can only satisfy its literal directives by ignoring the spirit of its request. In this case, this means that a content recommendation model could only recommend relevant content by adapting to Gail’s changing interests. Doing the opposite—making it a priority to distance the model from any kind of bias amplification—would only create an ineffective model.

More important, though, is that both approaches alter Gail’s values, because it is impossible for a predictive model not to. Gail, like YouTube, maintains a model of their own interests: a model that develops iteratively as Gail reacts to the model’s recommendations. Even if the YouTube algorithm recommended completely random videos, the ones Gail likes will still deepen their interests, and those they dislike or feel apathetic toward will dissuade them from exploring these interests. The same holds with an intentionally misrepresentative model, or even a human “model” like a friend or family member. Because every experience invariably contributes to our models of reality, there’s no categorical difference between videos that preserve their intentionality as something sacred and untouched and videos that corrupt it. Videos about archery, ukulele covers, and other things Gail has “always been interested in” create just as much of a feedback loop as authoritarian propaganda.

TL;DR: participating any kind of communication means participating in the ongoing reshaping of human values, no matter how we try to dissimulate it. 

Northpointe and recidivism prediction: why AI’s impact extends to social/ethical norms

Okay, we might think. It’s always been the case that looking at a different set of videos will make us like different things. How does AI make things worse? 

To dive into this issue, we have to explore the shift from AI’s impact on individual, subjective attitudes (e.g. the videos Gail was interested in) and an entire community’s social/ethical norms.

As a case study, take Brian Christian’s treatment of Northpointe’s recidivism prediction models, used by Illinois’s parole board for over 40 years. A recent audit found that these models approved or denied parole applications solely through zip code, income level, and other proxies for race. This led to a disproportionate amount of Black parole applicants serving full sentences. The presence or absence of racial discrimination on such a large scale almost certainly influences people’s ethical attitudes. 

However, the difference between moral values and subjective interests is that the former unfolds on the normative level. That is, cultural models for ethical behavior are shared agreements among individuals, rather than a collection of subjective attitudes. The following two paragraphs take a quick (if a little dense) detour to explain the difference (which people less interested in the philosophical nuts and bolts are more than welcome to skip).

The gist is that according to moral philosophers like Robert Brandom, normative (communal) statuses and subjective (individual) attitudes simultaneously determine the other and are determined by the other through a reciprocal relationship. Social norms form the epistemic raw material for subjective attitudes, since subjective attitudes only acquire content insofar as they are responses to social norms. The building blocks of discourse come from normative agreements about the meaning and future use-cases of things like pointing, addition, agreement, and honesty. Subjective attitudes ultimately emerge as internal ratifications of these normative agreements. The fact that I have views about the morality of stealing signifies that I’ve committed myself to recognizing theft on a conceptual and practical level. Attitudes, in this sense, are norm-dependent.

Meanwhile, “taking” normative commitments as subjective attitudes is coextensive with “making” them (i.e. reshaping the conceptual content that community members inherit). The particular instantiation of stealing as a concept in one’s life exceeds the bare-bones, normative sketch that community members first received. Our experience with actual, tangible instances of theft, and how it fits within our broader contextual model of the world, supplements the community’s “universal” conceptual norms. Thus, when I invoke my conception of theft, my particular experiences with theft--for instance, my experience with stealing food to pay for my starving family--alter the normative content the community recognizes as universal. So, using normatively rooted concepts—what Brandon takes after Hegel in calling “rituals of reconciliation”—lets us reshape our normative commitments to recognizing certain kinds of moral content, since it blends together normative statuses and the determinate shape, flexibility, and reconciliation of contradiction given by their particular use-cases. Norms, therefore, are attitude-dependent.

There are two upshots here for our discussion of recidivism modeling. The first is that in a certain way, all recidivism predictions, and all predictions in general, are misaligned. What we want when we’re thinking about alignment is a purely “non-biased” picture—that is, an objective, non-normative look at both our world and the values with which we approach it. But humanity’s reciprocal dependence on social norms makes apprehending the world non-normatively just as ridiculous as apprehending it non-conceptually. There is no such thing as a clear pair of glasses. Instead, the rose tint on a pair of lenses is the precondition for being able to see anything at all, if content only forms through tension with social norms.

So, the data in our contextual training set—that is, each experience of the world, as mediated by a normatively constructed consciousness--is not an unlabeled selection of YouTube videos. Each data point’s relationship to an ethical norm labels it as moral or immoral; desirable or undesirable. After all, if subjective attitudes are really positions about social norms, then these norms’ evaluative character carries through in their subjective manifestations. This, in Brandom’s sense, is part of how norms “make” reality: they are the default landscape into which we are thrown. Our modeling of the world is a supervised learning process. 

The second upshot, though, is that artificial and human intelligence have the same, direct influence on the normative foundation of our experience of reality. To explain this, let’s explore how “aligned” and “unaligned” predictions might both reshape human values, regardless of whether they come from a human or AI model.

This impact does much more than increase prejudice toward marginalized groups. By laying bare the racism inherent in our justice system, Northpointe’s models produce data through which people “learn” that the U.S. government (and companies like Northpointe) can be morally repugnant, since the data shows the arbitrariness of convictions. Conversely, these models’ predictions let us “learn” that prisoners are often unjustly convicted and, therefore, are less deserving of moral condemnation. 

Taken together, these predictions change how activities like non-violent and violent protest, prisoner visitation, and case work are valorized, and renders intelligible positions like abolitionism. Abolitionism is only moral insofar as prisoners do not deserve to be there. Recontextualizing how prisoners appear in our mental models of reality changes the normative, ethical practices it makes sense to adopt. 

It’s for this reason that “aligning” a model does not neutralize its worldly impact. When a 2016 audit unearthed the Northpointe model’s racist roots, data ethicists assembled an alternative model for recidivism prediction using only factors that seemed to more directly correlate with reoffending. Christian trumpets that the model not only matched Northpointe’s accuracy almost exactly, but, by moving away from direct proxies for race, curbed the more blatant manifestations of racism in the differences between false positive rates. 

But adopting this “aligned” model reshapes ethical beliefs as much as would adopting Northpointe’s models. Mitigating racism in recidivism predictions increases the average person’s faith in the U.S. government to accurately distinguish truly “dangerous” people from “harmless” people. It also restores faith in the carceral system generally, since the government’s “unbiased,” “aligned” prediction methods make it a near fact that certain people should be put away. 

Withholding parole becomes an ethical obligation in this scenario, since decoupling of recidivism prediction and racism means that it is the essence of applicants, the center of their being, that makes them dangerous, rather than illusions of race or class. Conversely, taking abolitionist positions, or protesting recidivism prediction techniques, becomes an immoral action, since it supports releasing a pack of ontologically depraved wolves onto a population of innocents. As previously stated: the entire question of morality and immorality comes down to context modeling. If our contextual model world judges X to be evil, the right thing to do is Y. If our model views X as good, the right thing to do is Z.

TL; DR: All AI models alter ethical norms just as much as they do individual, subjective attitudes, since the former spring from repeated concept use. But, altering norms has a more significant impact because norms are the building blocks for all conceptual content and the fundamental evaluative stances underpinning ethical deliberation. So, an AI altering normative statuses alters our access to reality itself, and the ethical, social, and logical dimensions through which we filter it.

Signatures and power: why AI is uniquely problematic for value drift

AI is uniquely problematic for value drift. Why?

That rationality and ethical deliberation depend on a normative foundation suggests that rationality's counterparts--force, power structures, and inequality--play a powerful role in reconciling conflicting systems of norms. The fact that each normative sphere is disjoint means that there’s no transcendent, meta-ethical rationality to bridge gaps and legislate collisions. “All else being equal, force decides.” There’s a violence inherent in the ability to sign off on one range of predictions as aligned and others as unaligned, to reduce ethical deliberation to the optimization of a given variable, to naturalize one system as being “the main one.” In the communal process of norm-taking, some voices are inherently louder than others.

The risk behind AI is that its institutional ties make it the loudest voice of all. An organization’s choice to use an AI model makes it a proxy of the institution. An institution says, “Look at this, everyone! I’m signing off on this model’s ability to represent my company. The fact that it could lose us so much money shows we’re serious.” And once the rest of the world is convinced, the voice of a single model can become louder than CEOs and world leaders: it wields the awesome power of the corporation-leviathan, and so, by making predictions as it was trained, it alters what humans believe.

Take Northpointe’s model. By deciding to implement it, the Illinois Parole Board codified the model as a proxy of the parole board. Even though the model had human intermediaries, even though a human parole board ultimately approved every decision, the reliance on Northpointe’s model consolidated the organization’s separated powers into mere conduits of the Northpointe model: the leviathan, the decision-maker. 

The result was that when the model predicted a Black man would reoffend for no other reason than race, it issued a statement on behalf of the Illinois state parole board that that man (and by extension, people like him) deserved to have their parole denied. This was concept-use in the Brandomian sense, in that it both drew on prior normative commitments and altered those commitments, but the model’s institutional signatures made its voice univocal and uniquely powerful. The parole board was too consolidated, was too beholden to the model’s decisions, to participate in the Hegelian “ritual of recognition” and check the model’s influence.

Indeed, the Northpointe model’s unique symbolic position made its predictions uniquely incontrovertible. On one hand, the model represented the Illinois parole board, and in so doing, became an arbiter of justice, a justice factory, an agency you could turn to for cutting edge new applications of age-old laws. But it was also a proxy of other institutions: science, math, technology, the future, etc. Anything under the heading “science” or “math” is a fountain of truth that can never be doubted, and the same was true with the Northpointe model. 

Moreover, because the model was hooked up to an institution, it was hooked up to ideological apparatuses that distributed the decision’s results. Newspapers and other media outlets could echo the Northpointe model’s ruling. The result was that Northpointe’s predictions could become descriptive disseminations of law, rather than an evaluative claim. They set precedent for future cases, and, in doing so, gave particular content to the universal normative statuses used to prosecute criminals.

The parole applicants it denied, their families, and even the parole board members themselves, could not nearly match the model’s influence. And, the Northpointe model wielded this influence solely because of its ability to become a proxy for an institution and vague, symbolic entities—a capacity unique to an AI model.

Before AI, your voice could only be so loud. An emperor was a person and, as a person, people could recognize them as capable of making mistakes, or, at the very least, articulating opinion rather than pure, unbridled truth. But now, truth itself can speak, as can science and corporations, and concentrations of discursive power that were too great for a single entity can be funneled into individual models. What happens if these proxies begin consolidating into fewer and fewer beings—if the power behind their voices increases? When do we lose the ability to question them? Did we ever have it in the first place?

TL; DR: Reasons why AI’s influence over human values is problematic: 1.) it can represent institutions and command their symbolic position; 2.) its voice is seen as uniquely powerful due to its association with “science,” “technology,” and other entities that are seen as incontrovertible; and 3.) it is connected to an institution’s ideological apparatuses.

 “So, uh… what do we do with this?”

It’s stupid to suggest that we abandon alignment efforts entirely. I would absolutely prefer to live in a world that matches our normative conceptions of alignment that one that doesn’t. But what’s important to note, in both cases, is how much this dilemma sucks. Either we suffer an extinction level event from misaligned AI, or we lose autonomy over our ethical, linguistic, and social norms—that is, over our access to reality itself. Anything is possible once discourse loses its anthropic center. The same phenomena that dehumanized those of other religious, racial, gender, sexual, or ethnic identities might cheapen the value of human life. Our moral values might shift to align with a GAI’s main objective, or to the achievement of one of its convergent instrumental goals. Entire systems of culture, religion, and tradition might be lost. 

Even more significantly, what if it’s this value drift itself that constitutes an extinction event? What if values drift to the point where we no longer care about humanity’s survival? Or, equally bad, to the point where we are indifferent about activities that may threaten it? The AI models that emerge in the next few years are the entities with the greatest power in history over truth, morality, and reality itself. The dystopian nightmare scenarios are endless.

Like I said: there is no transcendent ethical metric for classifying these impacts as bad. But if, for whatever reason, something compels us to combat extinction threats, then it’s critical to understand the drawbacks of potential solutions. 

The drawbacks of AI alignment have not been highlighted enough. A structural issue with any alignment solution is that creating an aligned model, just like creating an unaligned model, alters human values. Given this, aligning models with current iterations of ethical norms is an insufficient fix, since it ignores how such an aligned AI might change these values downstream. Any AI safety approach that does not account for value drift fails to remedy an extinction event.

If we adopt AI on a large scale, we’ll have to choose between extinction and disempowerment in the actualization of human values. But what if we resisted the distribution of AI entirely? What if we traced the world-ending potential of deep learning models to the logic of optimization underpinning them, and determined that the true solution isn’t changing how we apply optimization in machine learning models, but overturn optimization’s stranglehold on the economy, social norms, and many, many other spheres beyond AI? I believe that this solution is necessary, but it must remain a topic for another article.

1 comments

Comments sorted by top scores.

comment by Lorec · 2025-01-14T16:34:31.433Z · LW(p) · GW(p)

Are you familiar with Yudkowsky's/Miles's/Christiano's [LW · GW] AI Corrigibility concept?