The Gradient – The Artificiality of Alignment

post by mic (michael-chen) · 2023-10-08T04:06:40.074Z · LW · GW · 1 comments

This is a link post for https://thegradient.pub/the-artificiality-of-alignment/

Contents

  Computer scientists love a model
  So how do we solve extinction?
None
1 comment

The Gradient is a “digital publication about artificial intelligence and the future,” founded by researchers at the Stanford Artificial Intelligence Laboratory. I found the latest essay, “The Artificiality of Intelligence,” by a PhD student at UC Berkeley, to be an interesting perspective from AI ethics/fairness.

Some quotes I found especially interesting:

For all the pontification about cataclysmic harm and extinction-level events, the current trajectory of so-called “alignment” research seems under-equipped — one might even say misaligned — for the reality that AI might cause suffering that is widespread, concrete, and acute. Rather than solving the grand challenge of human extinction, it seems to me that we’re solving the age-old (and notoriously important) problem of building a product that people will pay for. Ironically, it’s precisely this valorization that creates the conditions for doomsday scenarios, both real and imagined. …

In a recent NYT interview, Nick Bostrom — author of Superintelligence and core intellectual architect of effective altruism — defines “alignment” as “ensur[ing] that these increasingly capable A.I. systems we build are aligned with what the people building them are seeking to achieve.”

Who is “we”, and what are “we” seeking to achieve? As of now, “we” is private companies, most notably OpenAI, the one of the first-movers in the AGI space, and Anthropic, which was founded by a cluster of OpenAI alumni.

OpenAI names building superintelligence as one of its primary goals. But why, if the risks are so great? … first, because it will make us a ton of money, and second, because it will make someone a ton of money, so might as well be us. …

Of course, that’s the cynical view, and I don’t believe most people at OpenAI are there for the sole purpose of personal financial enrichment. To the contrary, I think the interest — in the technical work of bringing large models into existence, the interdisciplinary conversations of analyzing their societal impacts, and the hope of being a part of building the future — is genuine. But an organization’s objectives are ultimately distinct from the goals of the individuals that comprise it. No matter what may be publicly stated, revenue generation will always be at least a complementary objective by which OpenAI’s governance, product, and technical decisions are structured, even if not fully determined. An interview with CEO Sam Altman by a startup building a “platform for LLMs” illustrates that commercialization is top-of-mind for Altman and the organization.[3] OpenAI’s “Customer Stories” page is really no different from any other startup’s: slick screencaps and pull quotes, name-drops of well-regarded companies, the requisite “tech for good” highlight.

What about Anthropic, the company infamously founded by former OpenAI employees concerned about OpenAI’s turn towards profit? Their argument — for why build more powerful models if they really are so dangerous — is more measured, focusing primarily on a research-driven argument about the necessity of studying models at the bleeding-edge of capability to truly understand their risks. Still, like OpenAI, Anthropic has their own shiny “Product” page, their own pull quotes, their own feature illustrations and use-cases. Anthropic continues to raise hundreds of millions at a time.[4]

So OpenAI and Anthropic might be trying to conduct research, push the technical envelope, and possibly even build superintelligence, but they’re undeniably also building products — products that carry liability, products that need to sell, products that need to be designed such that they claim and maintain market share. Regardless of how technically impressive, useful, or fun Claude and GPT-x are, they’re ultimately tools (products) with users (customers) who hope to use the tool to accomplish specific, likely-mundane tasks.

Computer scientists love a model

… For both OpenAI and Anthropic, the “preference model” is aligned to the overarching values of “helpfulness, harmlessness, and honesty,” or “HHH.”[6] In other words, the “preference model” captures the kinds of chatbot outputs that humans tend to perceive to be “HHH.” …

All of these technical approaches — and, more broadly, the “intent alignment” framing — are deceptively convenient. Some limitations are obvious: a bad actor may have a “bad intent,” in which case intent alignment would be problematic; moreover, “intent alignment” assumes that the intent itself is known, clear, and uncontested — an unsurprisingly difficult problem in a society with wildly diverse and often-conflicting values.

The “financial sidequest” sidesteps both of these issues, which captures my real concern here: the existence of financial incentives means that alignment work often turns into product development in disguise rather than actually making progress on mitigating long-term harms. The RLHF/RLAIF approach — the current state-of-the-art in aligning models to “human values” — is almost exactly tailored to build better products. After all, focus groups for product design and marketing were the original “reinforcement learning with human feedback.” …

To be fair, Anthropic has released Claude's principles to the public, and OpenAI seems to be seeking ways to involve the public in governance decisions. But as it turns out, OpenAI was lobbying for reduced regulation even as they publicly “advocated” for additional governmental involvement; on the other hand, extensive incumbent involvement in designing legislation is a clear path towards regulatory capture. Almost tautologically, OpenAI, Anthropic, and similar startups exist in order to dominate the marketplace of extremely powerful models in the future.

These economic incentives have a direct impact on product decisions. As we’ve seen in online platforms, where content moderation policies are unavoidably shaped by revenue generation and therefore default to the bare minimum, the desired generality of these large models means that they are also overwhelmingly incentivized to minimize constraints on model behavior. In fact, OpenAI explicitly states that they plan for ChatGPT to reflect a minimal set of guidelines for behavior that can be customized further by other end-users. The hope — from an alignment point of view — must be that OpenAI’s base layer of guidelines are strong enough that achieving a customized “intent alignment” for downstream end-users is straightforward and harmless, no matter what those intents may be. …

Rather than asking, “how do we create a chatbot that is good?”, these techniques merely ask,  “how do we create a chatbot that sounds good”? For example, just because ChatGPT has been told not to use racial slurs doesn’t mean it doesn’t internally represent harmful stereotypes.

So how do we solve extinction?

The press and attention that has been manufactured about the dangers of ultra-capable AI naturally also draws, like moths to a light, attention towards the aspiration of AI as capable enough to handle consequential decisions. The cynical reading of Altman’s policy tour, therefore, is as a Machiavellian advertisement for the usage of AI, one that benefits not just OpenAI but also other companies peddling “superintelligence,” like Anthropic.

The punchline is this: the pathways to AI x-risk ultimately require a society where relying on — and trusting — algorithms for making consequential decisions is not only commonplace, but encouraged and incentivized. It is precisely this world that the breathless speculation about AI capabilities makes real.

Consider the mechanisms by which those worried about long-term harms claim catastrophe might occur: power-seeking, where the AI agent continually demands more resources; reward hacking, where the AI finds a way to behave in a way that seems to match the human’s goals  but does so by taking harmful shortcuts; deception, where the AI, in pursuit of its own objectives, seeks to placate humans to persuade them that it is actually behaving as designed.

The emphasis on AI capabilities — the claim that “AI might kill us all if it becomes too powerful” — is a rhetorical sleight-of-hand that ignores all of the other if conditions embedded in that sentence: if we decide to outsource reasoning about consequential decisions — about policy, business strategy, or individual lives — to algorithms. If we decide to give AI systems direct access to resources, and the power and agency to affect the allocation of those resources — the power grid, utilities, computation. All of the AI x-risk scenarios involve a world where we have decided to abdicate responsibility to an algorithm. …

The newest models are truly remarkable, and alignment research explores genuinely fascinating technical problems. But if we really are concerned about AI-induced catastrophe, existential or otherwise, we can’t rely on those who stand to gain the most from a future of widespread AI deployments.

1 comments

Comments sorted by top scores.

comment by Tamsin Leake (carado-1) · 2023-10-08T11:44:58.241Z · LW(p) · GW(p)

the pathways to AI x-risk ultimately require a society where relying on — and trusting — algorithms for making consequential decisions is not only commonplace, but encouraged and incentivized

this is wrong, of course. the whole point of alignment, the thing that makes AI doom a particular type of risk, is that highly capable AI takes over the world on its own just fine. it does not need us to put it in charge of our institutions, it just takes over everything regardless of our trust or consent.

all it takes is one team, somewhere, to build the wrong piece of software, and a few days~months later all life on earth is dead forever. AI doom is not a function of adoption, it's a function of the-probability-that-on-any-given-day-some-team-builds-the-thing-that-will-take-over-the-world-and-kill-everyone.

(this is why i think we lose control of the future in 0 to 5 years, rather than much later)