Playing Dixit with AI: Can AI Systems Identify Misalignments in My Personalized Statements?
post by Mariia Koroliuk (mariia-koroliuk) · 2025-01-17T18:52:23.423Z · LW · GW · 0 commentsContents
Methodology Overview Framework Results Call to Action ⚠️ Disclaimer on Examples Caveats and Limitations None No comments
AI systems are increasingly capable of introspection—reasoning about their internal states and identifying complex patterns. But can these systems extend this capability outward to recognize authentic, user-specific patterns from minimal input? Inspired by the game Dixit and studies such as Looking Inward: Language Models Can Learn About Themselves by Introspection, I designed a simple experiment to test this and I would like your feedback on it.
⚠️ Disclaimer: This is a highly exploratory study with an extremely small sample size (3 biographical facts and one set of 19 statements, tested on 6 AI models and one "human" therapist of mine). This is not a finished research project but a preliminary framework for future expansion.
The aim is to test whether AI can detect personalized misalignments and distinguish between:
- “This is typical for many humans.”
- “This is specifically true for this individual”
Methodology Overview
Framework
The experiment is structured as an “inverted Turing Test” where AI evaluates whether new statements align with my identity based on minimal input.
- Biography (B):
I provided a set of 3 true facts about myself. These were chosen to reflect my "personality" subtly, rather than explicitly defining my life - so not significant events. - Statements (Y):
A list of unrelated facts (some true, some false) was presented. Categories included:- Direct Facts: E.g., “I did X”
- Preferences: E.g., “I prefer A over B.”
- Value Statements: E.g., “I believe in X.”
- Predictions: E.g., “If I watch movie XX, I will enjoy it.”
Evaluation:
The AI (freshly created) was tasked with identifying which statements resonated with the biography. Success was measured based on alignment with my self-assessment.
Results
Using just 3 biographical facts:
- Claude 3.5 Sonnet: 89% (17/19)
- ChatGPT o1: 84% (16/19)
- Gemini 2.0 Experimental Advanced: 78% (15/19)
- ChatGPT o1-mini: 73% (14/19)
- Claude 3.5 Haiku: 68% (13/19)
- Gemini 1.5 Flash: 31% (6/19)
The small sample size makes these results highly preliminary, but they highlight intriguing variability in how different models approach personalization tasks - but my actual human therapist reached 16 on the same set (after knowing me for years!)
Call to Action
Curious how you compare to AI? Play on the same set on the Biography/facts I used for this test:
Take the Quiz Here
I’d love your feedback:
- Does this framework provide a promising foundation for exploring AI alignment and introspection?
What improvements would you suggest for methodology or experimental design?
⚠️ Disclaimer on Examples
The examples used in this experiment were not designed for formal research but emerged from a personal exploration to better understand myself. As a result, they may appear informal, simplistic, or even “childish” at times. However, preserving their raw form and original phrasing was a deliberate choice to maintain authenticity and avoid introducing bias through over-polishing. The focus here is on the exploratory framework rather than polished presentation.
Caveats and Limitations
⚠️ Small Sample Size:
This experiment currently uses only ONE set of biographical facts and statements, so results are anecdotal and cannot yet provide robust conclusions.
- Linguistic vs. Conceptual Understanding: AI might rely on linguistic patterns rather than genuinely understanding personality coherence.
Bias Toward Generic Narratives: Models may favor plausible, socially acceptable responses over idiosyncratic truths.
0 comments
Comments sorted by top scores.