Posts
Comments
random spurious accusations with zero factual backing are usually considered town/vanilla/arthurian moves in werewolf games; irl this breeds chaos and is a classic DARVO tactic.
In my experience this is only true for beginner play (where werewolves are often too shy to say anything), and in advanced play it is a bad guy tactic for the same reasons as IRL. Eg I think in advanced Among Us lobbies it's an important skill to subtly push an unproductive thread of conversation without making it obvious that you were the one who distracted everybody.
It's not clear/concrete to me in what ways points 3 and 5 are supposed to invalidate the analogy.
in games, bad guys have a heavy incentive to be uncorrelated (and especially to be seen as being uncorrelated); irl there are very few worlds where regularly agreeing with the now-known-to-be-bad-actors is a positive update on your innocence.
I don't understand this - it reads to me like you're saying a similar thing is true for the game and real life? But that goes against your position.
Perhaps related: Is democracy a fad?
Oh, I see what you're saying now. Thanks for clarifying.
But this would apply to the visual cortex as well right? So it doesn't explain the discrepancy.
I appreciate the charity!
I'm not claiming that people don't care about other people's internal states, I'm saying that it introspectively doesn't feel like that is implemented via empathy (the same part of my world model that predicts my own emotions), but via a different part of my model (dedicated to modeling other people), and that this would solve the "distinguishing-empathy-from-transient-feelings" mystery you talk about.
Additionally (but relatedly), I'm also skeptical that those beliefs are better decribed as being about other people's internal states rather than as about their social behavior. It seems easy to conflate these if we're not introspectively precise. E.g., if I imagine myself in your Alice example, I imagine Alice acting happy, smiling and uncaring, and only then is there any reaction - I don't even feel like I'm *able* to viscerally imagine the abstract concept (prod a part of my world model that represents it) of "Alice is happy".
But these are still two distinct claims, and the latter assumes the former.
One illustrative example that comes to mind is the huge number of people who experience irrational social anxiety, even though they themselves would never judge themselves if they were in other people's position.
Thanks for the reply!
- In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
- In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
- When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!
How sure are you that these are instances of empathy (defining it as "prediction by our own latent world model of ourselves being happy/unhappy soon")? If I imagine myself in these examples, it doesn't introspectively feel like I am reacting to an impression of their internal state, but rather like I am directly reacting to their social behavior (e.g., abstractly speaking, a learned reflex of status-reasserting anger when someone else displays high status through happy and calm behavior).
This would also cleanly solve the mysteries of why they don't get updated and how they are distinguished from "other transient feelings" - there's no wrong prediction by the latent world model involved (nothing to be distinguished or updated), and the social maneuvering doesn't get negative feedback.
That's where some instinctive disagreement of mine with that post of yours comes from too. But I also haven't read through it carefully enough to be sure.
In the specific example of chocolate (unless it wasn't supposed to be realistic), are you sure it doesn't get trained away? I don't think that, upon seeing someone eating chocolate, I immediately imagine tasting chocolate. I feel like the chocolate needs to rise to my attention for other reasons, and only then do I viscerally imagine tasting chocolate.
Jacob Cannell discusses this here https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know#Data
Katja Grace's p(doom) is 8% IIRC
Ah, I see what you mean! Interesting perspective. The one thing I disagree with is that a "gradient" doesn't seem like the most natural way to see it. It seems like it's more of a binary, "Is there (accurate) modelling of the counterfactual of your choice being different going on that actually impacted the choice? If yes, it's acausal. If not, it's not". This intuitively feels pretty binary to me.
I don't think the "zero-computation" case should count. Are two ants in an anthill doing acausal coordination? No, they're just two similar physical systems. It seems to stretch the original meaning , it's in no sense "acausal".
I disagree. There is no acausal coordination because eg the reasoning "If everyone thought like me, democracy would fall apart" does not actually influence many people's choice, ie they would vote due to various social-emotional factors no matter what that reasoning said. It's just a rationalization.
More precisely, when people say "If everyone thought like me, democracy would fall apart", it's not actually the reasoning that it could be interpreted as, it's a vague emotional appeal to loyalty/the identity of a modern liberal/etc. You can tell because it refers to "everyone" instead of a narrow slice of people, it involves no modelling of the specific counterfactual of MY choice, there's no general understanding of decision theory that would allow this kind of reasoning to happen and any reasonable model of the average person's mind doesn't allow it imo.
Your model is also straining to explain the extra taxes thing. "Voting is normal, paying extra taxes isn't" is much simpler.
In general, I'm wary of attempts to overly steelman the average person's behavior, especially when there's a "cool thing" like decision theory involved. It feels like a Mysterious Answers to Mysterious Questions kind of thing.
I've been thinking along similar lines, but instinctively, without a lot of reflection, I'm concerned about negative social effects of having an explicit community-wide list of "trusted people".
After thinking about it a little bit, the only hypothesis I could come up with for what's going on in the negation example is that the smaller models understand the Q&A format and understand negation, but the larger models have learned that negation inside a Q&A is unusual and so disregard it.
Very useful, thank you!
Thanks for this post, this looks very useful :) (it comes at a great time for me since I'm starting to work on my first self-directed research project right now).
I'm very interested, but since you've already found someone, please post the results! :)
Thanks! Am probably convinced by the third point, unsure about the others due to not having much time to think at the moment.
This has been my vague intuition as well, and I'm confused as to where exactly people think this argument goes wrong. So I would appreciate some rebuttals to this.
For 9, are you thinking of grokking?
Thanks for the post. A clarifying question: Are you claiming that / do you think that these framings are extensionally equivalent?
Sorry, I should be more specific. We are talking about AGI Safety, it seems unlikely that running narrow AI faster gets you AGI. I'm not sure if you disagree with that. I don't understand what you mean by "imitations of augmented of humans" and "planning against a human-level imitation".
This "imitating an optimizer" / "optimizing to imitate" dichotomy seems unnecessarily confusing to me. Isn't it just inner alignment / inner misalignment (with the human behavior you're being trained on)? If you're imitating an optimizer, you're still an optimizer.
I must be missing something here. Isn't optimizing necessary for superhuman behavior? So isn't "superhuman behavior" a strictly stronger requirement than "being a mesaoptimizer"? So isn't it clear which one happens first?
Great post. Would add as an example: "While thinking about something and trying to figure out your viewpoint on it, track internal feelings of cognitive dissonance and confusion"
Have you gotten farther with this? It seems like a potentially very impactful thing to me. I also had the idea recently of paying skeptical AI researchers to spend a few hours discussing/debating their reasons for skepticism