Making LLMs safer is more intuitive than you think: How Common Sense and Diversity Improve AI Alignment
post by Jeba Sania (jeba-sania) · 2024-12-29T19:27:35.685Z · LW · GW · 0 commentsContents
Creating effective AI alignment methods is more intuitive now Demonstrating the intuitiveness of 3 popular AI Alignment methods through everyday analogies 1. Constitutional AI 2. Task Decomposition: Iterated Amplification 3. Debate Today's most popular method can only get us so far We should look to other sources of inspiration for reasoning techniques. Inspiration from Culture: Haudenosaunee Seven Generations Principle The Inspiration Application to AI Alignment Inspiration from Nature: Apoptosis The Inspiration Application to AI Alignment Inspiration from Governance Models: Checks and Balances The Inspiration Application to AI Alignment Improving AI is more intuitive than you think, so the barrier to providing useful perspectives is lower than you think. References None No comments
AI safety isn't purely technical; it's also about applying common sense and human reasoning. Using reasoning techniques from around the world instead of just the Global North, we can better align AI with human values. If you are interested in AI safety but have an untraditional background or skill set, don't fret. That's precisely why your ideas are needed.
Creating effective AI alignment methods is more intuitive now
Richard Ngo, a well-known AI governance researcher, defines AI alignment as:
"ensuring that AI systems pursue goals that match human values or interests rather than unintended and undesirable goals."
Before generative AI, AI researchers primarily aligned models by focusing on carefully curating training data. By preventing models from picking up biases present in training data, researchers were more confident that models would not make discriminatory decisions when deployed.
Generative AI has changed everything.
Traditionally, researchers translate their objectives into a mathematical loss function and incentivize the model to minimize that function. However, generative AI is open-ended, and that makes alignment tricky.
Human values such as "ethical," "fair," or "harmful" defy being reduced into simple mathematical expressions, which means Generative AI models operate in environments with ambiguous objectives - like we do.
While this seems like a problem, we can actually think about AI alignment much more intuitively now. We can think about AI alignment much like human alignment - a task we have millennia of experience with.
In the next section, you'll learn that many of today's AI alignment methods rely heavily on common sense reasoning. These methods mirror how humans might reinforce stronger values for non-AI beings.
Demonstrating the intuitiveness of 3 popular AI Alignment methods through everyday analogies
Let's dive into three popular methods. I'll break down the key ideas behind them and provide everyday analogies.
1. Constitutional AI
Constitutional AI is a technique used by Anthropic's Claude model that has proven surprisingly robust. Its goal is to ensure LLMs create responses that are as harmless as they are helpful.
The Basic Idea | Everyday Analogy |
| Say you are a writing teacher.
|
Easy enough, right? If you're curious, you can read about the exact constitution Anthropic used to develop Claude here. Let's move on to another method.
2. Task Decomposition: Iterated Amplification
Task decomposition aims to help humans better judge AI's responses by decomposing responses and evaluating each part individually.
It is useful when AIs are prompted to solve complex problems where the solutions are difficult for humans to judge holistically (e.g., create a maximally optimal traffic system for New York City).
The Basic Idea | Everyday Analogy |
| Say you are a bakery owner who makes wedding cakes for hundreds of weddings a year and wants to judge the quality of your business:
|
To dive deeper into Iterated Amplication, you can watch a fantastic explanation by science communicator Rob Miles here.
3. Debate
Debate is a useful technique to ensure individual AI systems are not deceiving humans. Debate can also guard against lying or manipulation to receive human approval (see sycophantic behavior).
The Basic Idea | Everyday Analogy |
| Say you are a parent to two sweet but mischievous teenagers.
|
You may have noticed that the problems at the heart of each method are basic reasoning problems, such as self-reflection or problem simplification. Constitutional AI centers around improving self-evaluation through a set of guiding values. Task decomposition is a straightforward approach to solving and evaluating a complex solution. Debate is about preventing negative collaboration and providing proof of reasoning. By applying common sense and imagining AI as if it were a non-AI being, we can focus on creatively solving basic reasoning challenges for LLMs.
These are just three alignment methods. If you'd like to learn more alignment methods that use common reasoning techniques, start here.
Today's most popular method can only get us so far
Another method, Reinforcement Learning from Human Feedback (RLHF), is the top industry choice to align AI systems with human values. Many AI companies, such as OpenAI and Scale AI, rely on RLHF.
However, RLHF's alignment performance leaves much to be desired. Maybe not uncoincidentally, RLHF heavily depends on mathematical loss expressions during alignment. This graphic from an Anthropic paper details the performance limitations of RHLF compared to Constitutional AI [1].
We should look to other sources of inspiration for reasoning techniques.
You should join this effort if you can think of more creative reasoning techniques. You may have started to brainstorm other reasoning frameworks while reading. You should especially join if you believe your ideas are obvious and wonder why no one has implemented them.
The truth is that most AI alignment research is geographically concentrated in certain regions of the Global North. Researchers today likely have similar academic backgrounds and training. They may even share the same languages, cultures, religions, and ethnicities.
Breakthroughs in AI safety will require diverse perspectives, experiences, and modes of thinking.
We should look towards reasoning and decision-making techniques around the globe for inspiration. AI needs to be safe for everyone, so AI safety should be a globally representative field. AI alignment should even go beyond focusing solely on humans, ensuring AI is safe for the environment and animals.
We can't rely on a small subset of the population to develop the best techniques. This approach will risk marginalizing all others impacted by generative AI in the coming years and deprive the field itself of transformative safety advances.
The good thing is inspiration is all around us if we look closely enough. Here are three diverse sources of inspiration which AI alignment could draw from and three examples of practical frameworks they might lead to.
Inspiration from Culture: Haudenosaunee Seven Generations Principle
The Inspiration
This Native American philosophy emphasizes decision-making that benefits the present and the next seven generations [2]. From the Haudenosaunee (Iroquois) Great Law of Peace, this approach ensured future descendants were not voiceless. In practice, the principle prioritized sustainability and continued responsibility for the welfare of people.
Application to AI Alignment
Models could incorporate long-term predictions, ensuring decisions align with future sustainability and impact goals. These systems could simulate the downstream consequences of decisions over extended periods, leading to more informed decision-making.
Inspiration from Nature: Apoptosis
The Inspiration
Apoptosis refers to programmed cell death. This protective mechanism replaces abnormal, damaged, or aging cells with younger and healthier cells. Without apoptosis, uncontrollable cell growth can lead to life-threatening diseases like cancer.
Application to AI Alignment
Researchers implement mechanisms where AI models autonomously recognize misalignment, shut down harmful behaviors, or even self-destruct (via wiping out network weights) for severe cases.
Inspiration from Governance Models: Checks and Balances
The Inspiration
Checks and balances are a popular governance concept used by the US government and many global institutions such as the International Criminal Court. This concept refers to any system of independent bodies within a single organization that counterbalances each other's influence, ensuring no single body has concentrated power or authority.
Application to AI Alignment
We could build modular AI systems with distinct sub-components focusing on different objectives (e.g., overall goal, ethical considerations, social implications). These agents could check each other's outputs, flagging potential high-risk conflicts or misalignment.
Improving AI is more intuitive than you think, so the barrier to providing useful perspectives is lower than you think.
AI alignment is far more intuitive than it may initially seem. We can make meaningful strides by drawing on familiar reasoning and decision-making frameworks we use daily.
Thankfully, inspiration is all around us. We can source innovative reasoning techniques by being open to learning from other cultures, nature, and governance systems that have endured hundreds or even thousands of years.
This challenge isn't just for ML researchers or technologists. If you bring a unique background or creative perspective, your contribution is exactly what's needed to tackle AI alignment.
You might have the answers to make AI safer for everyone.
References
[1] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., Elhage, N., Hernandez, D., Hume, T., Johnston, S., Kravec, S., . . . Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. ArXiv. https://arxiv.org/abs/2204.05862
[2] Joseph, B. (2024, April 2). What is the seventh generation principle? Indigenous Corporate Training Inc. https://www.ictinc.ca/blog/seventh-generation-principle#:~:text=The%20Seventh%20Generation%20Principle%20is,a%20people%20considered%20%E2%80%9Csavages%E2%80%9D
0 comments
Comments sorted by top scores.