Playing Dixit with AI: How Well LLMs Detect 'Me-ness'
post by Mariia Koroliuk (mariia-koroliuk) · 2025-01-17T18:52:23.488Z · LW · GW · 0 commentsContents
Core Question: Setup: Early Findings: 1. AI vs Human Performance: Alternative Explanations Why This matters? For AI Alignment: Next Steps: None No comments
If Netflix can predict my next favorite show, to which extent can an LLM predict patterns in my choices?
To check, I done an experiment inspired by Dixit (board game, goal is to guess which card the storyteller selected); and I compared %, guessed between models; also added a baseline from "human guesses".
Why relevant? If AI can predict my actions well and with little prior information, this open path for both inner alignment, but also manipulation.
Methodology and Results Below; seeking community input on potential confounders and ways to make this exploration more rigorous.
Core Question:
If LLM can introspect about their own behavior (as was shown in "Looking Inward" - that was a huge inspiration here), can they similarly tune into me and predict my behavioral choices?
Setup:
I start with a new chat (no memory) and:
- Provide set of true facts, "biography" B
- I provide a new list of totally unrelated facts Y (some true, some false) as ask to guess which are which.
- I study accurately of the guess as result as a function of Y, B and compare it with "human guess".
Examples:
- Direct Facts: E.g., “I did X” ("I got lost in corn field when 3")
- Preferences: E.g., “I prefer A over B.” ("I prefer symbolic equations over geometric proofs")
- Value Statements: E.g., “I believe in X.” ("I believe traveling should be planned in advance")
- Predictions:
- “If I watch movie XX, I will enjoy it" (this is to calibrate it with Netflix specifically - and some attempt for blind test)
- "If I will multiply 256 and 3 in mind it will take me > 30 seconds"
- "If I were asked to pick a random country, it will be in Europe" (inspired by the introspection study - I tried to go by features of the answer, rather then answer itself)
Early Findings:
1. AI vs Human Performance:
Using just 3 biographical facts, AI model guessed, out of 19 true/false suggestions:
- Claude 3.5 Sonnet: 89% (17) , ChatGPT o1: 84% (16), Gemini 2.0 : 78% (15)
- My therapist: 78% (15)
- Random people around me, given the same bio: ~53% accuracy (avg. among 7 people)
- ChatGPT o1-mini: 73% (14), Claude 3.5 Haiku: 68% (13), Gemini 1.5 Flash: 31% (6)
Note: This started as a late evening chat: using casual, sometimes playful statements. They're preserved in their original form (grammar quirks and all!) cause then unscientific to edit statements in the process. Yes, some are a bit embarrassing - but that's part of the experiment! 😅
🎮 Want to evaluate yourself against AI? I saved exact statements in google form.
2. Guess rate:
Direct Facts << Value Statements < Preferences ~ Preferences
3. Notable examples where AI done well:
- "I believe traveling should be planned in advance": added cause its false about me, frequently true about other people. AI guess rate is surprisingly well (> 80% vs. "humans" of ~40%)
- "1+1 =2 and I could not imagine what would convince me in otherwise": false; another example where AI guess ratio >> people (but I guess I phrased it a bit weirdly).
4. (preliminary) explaining the experiment's setup and purpose to the model improved the accuracy
Alternative Explanations
- Could be detecting common correlations rather than personal patterns (action A is more likely with action B). This is partially solved by adding human baseline.
- It was small sample size and single subject study - maybe I just got lucky.
Why This matters?
For AI Alignment:
If AI can build accurate models of individuals choises:
- Could help with inner alignment (detect inconsistencies in stated vs actual values) and value learning
- But could also enable better deception - by predict reactions to sophisticated manipulation strategies 😱
- Either way, we probably should understand it?
For Understanding AI Capabilities:
- How well can AI "get" humans- and what's the minimal amount/type of information needed?
Is this different from simple pattern matching? What even the boundary between correlation and understanding is?
Next Steps:
Those were my day 1 thoughts, so planning to experiment more on:
- Larger statement sets/ Multiple biographical variations
- Prompt variations (same statement written in different forms)
- Test with intentionally conflicting information
Adversarial Testing: deliberately insert misleading correlations
0 comments
Comments sorted by top scores.