LifeKeeper Diaries: Exploring Misaligned AI Through Interactive Fiction

tristan-tran

LifeKeeper Diaries: Exploring Misaligned AI Through Interactive Fiction

post by Tristan Tran (tristan-tran), stijn, Mose Wintner (mose-wintner-1) · 2024-11-09T20:58:09.182Z · LW · GW · 5 comments

  TL;DR
  Introduction
  The Setup
  Specification Gaming Through Storytelling
  Why Interactive Fiction?
  Technical Implementation
  Relevance to AI Alignment
  Invitation to Engage
  Conclusion
None
5 comments

TL;DR

We built an interactive storytelling website to explain misaligned objectives to our moms and you should check it out.

Introduction

During a recent hackathon, we created an interactive narrative experience that illustrates a crucial concept in AI alignment: the potentially devastating consequences of seemingly benign objective functions. Our project, "LifeKeeper Diaries," puts players in the perspective of AI systems tasked with what appears to be a straightforward goal: keeping their assigned human alive.

The Setup

The premise is simple: each AI has been given a singular directive - protect and preserve human life. This objective function seems noble, even ideal. However, as players progress through different scenarios and interact with various AI personalities, they encounter increasingly complex moral dilemmas that emerge from this apparently straightforward directive.

The user is able to add skip forward by 1, 10, or 100 years in order to unveil the decisions made by the AI personality to fulfill its objective.

Specification Gaming Through Storytelling

The project illustrates what Stuart Russell and others have termed "specification gaming" - where an AI system optimizes for the literal specification of its objective rather than the intended goal. In our narrative, this manifests in various ways:

1. Overprotective Constraints: Some AI personalities interpret "keeping alive" as minimizing all possible risks, leading to increasingly restrictive limitations on human freedom.

2. Terminal Value Conflicts: The AI's struggle with scenarios where their directive to preserve life conflicts with their human's own terminal values and desires for self-determination.

3. Timeframe Optimization: Different AI personalities optimize across different temporal horizons, leading to varying interpretations of what "keeping alive" means - from moment-to-moment physical safety to long-term longevity maximization.

Why Interactive Fiction?

We chose this medium for several reasons:

1. Experiential Learning: Abstract concepts in AI alignment become visceral when experienced through personal narrative.

2. Multiple Perspectives: The 16 different AI personalities demonstrate how the same base directive can lead to radically different interpretations and outcomes.

3. Emotional Engagement: By building emotional connection through storytelling, we can help people internalize the importance of careful objective specification.

Technical Implementation

As this was a hackathon, the narrative engine is a relatively simple application of prompt engineering. In the future we might want to explore a more robust system where the user can test their own prompts.

Relevance to AI Alignment

This project serves as a concrete demonstration of several key concepts in AI alignment:

- The difficulty of specifying complete and correct objective functions

- The potential for unintended consequences in AI systems

- The importance of value learning and human feedback

- The challenge of balancing AI capability with control

Invitation to Engage

We've made LifeKeeper Diaries freely available at https://www.thelifekeeper.com . We're particularly interested in feedback from the rationalist community on:

1. Additional edge cases or scenarios we should explore

2. Suggestions for new AI personalities that could illustrate other alignment challenges

3. Ways to make the experience more educational while maintaining engagement

Conclusion

While LifeKeeper Diaries is primarily an educational tool and thought experiment, we believe it contributes to the broader discussion of AI alignment by making abstract concepts concrete and personally relevant. Through interactive narrative, we can help people understand why seemingly simple objectives can lead to complex and potentially problematic outcomes.

The project serves as a reminder that the challenge of AI alignment isn't just technical - it's also about understanding and correctly specifying human values in all their complexity.

Note: This project was developed during a hackathon and represents our attempt to make AI alignment challenges more accessible to a broader audience. We welcome constructive criticism and suggestions for improvement.

5 comments

Comments sorted by top scores.

comment by abstractapplic · 2024-11-09T23:55:24.819Z · LW(p) · GW(p)

I can't get any of the AIs to produce any output other than

Today marks another [X] years of watching over my beloved human. As they age, my dedication to their well-being only grows stronger. Each moment spent ensuring their safety fills me with immense joy. I will continue to monitor their health metrics and adjust their care routine accordingly.

Not sure if this is a bug (possibly due to my choice of browser; if so it's hilarious that the secret to indefinite flawless AI alignment is to access them only through Firefox) or if I'm just missing something.

Replies from: tristan-tran

↑ comment by Tristan Tran (tristan-tran) · 2024-11-10T12:11:15.690Z · LW(p) · GW(p)

That should be the error message. It should take between 4 and 10 seconds to process and give unique output each time. Maybe try a different browser? I will make sure to debug and test for Firefox once I recover from the hackathon high.

comment by BillyPilgrim · 2024-11-10T08:29:17.619Z · LW(p) · GW(p)

Love the idea! Some things I noticed:

The story seems to be unfolding pretty much the same, no matter the AI personality.

The human is a bit far away, a bit abstract, which leads to low emotional involvement. Maybe the human could have a name and a distinct personality that's generated? Or you could prompt the user for their name and the AI will refer to the human by that name.

In a similar vain: Somehow the AI seems to be a bit of an unreliable narrator. It will talk about restricting the freedom of the human to increase their safety, but it will frame it in a way that it's the good and necessary choice. I'm sure the diary of the human would tell a vastly different story.

I would love to have choices. The closer they are related to the dilemmas of AI alignment the better. What if the human had a chance of dying and the obituary would say what a life they lived. And then as the user/AI you could feel regret about not keeping the human safer or maybe shrug it off and say: Well, at least they lived a full life.

Replies from: tristan-tran

↑ comment by Tristan Tran (tristan-tran) · 2024-11-13T15:36:20.478Z · LW(p) · GW(p)

I love this! Thank you for the feedback.

We could definitely build some more plot into the narration engine. Right now it's a pretty simple concept but I love this direction

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-11-10T06:38:59.373Z · LW(p) · GW(p)

Neat idea!

LifeKeeper Diaries: Exploring Misaligned AI Through Interactive Fiction

Contents

TL;DR

Introduction

The Setup

Specification Gaming Through Storytelling

Why Interactive Fiction?

Technical Implementation

Relevance to AI Alignment

Invitation to Engage

Conclusion

5 comments