Formalizing Deception
post by JamesH (AtlasOfCharts) · 2022-06-26T17:39:01.390Z · LW · GW · 2 commentsContents
Introduction Interrogation Investigation Formalization Comments on the Formalization Updates and Repetition Flaws Remaining Questions None 2 comments
An attempt at formalizing deception, seeing what goes right, what goes wrong, and what compromises have to be made.
Introduction
In a police interrogation one of the key challenges is working out whether a suspect is being deceptive. In a game of poker, one of the main goals is figuring out whether your opponents are being deceptive. But what are we talking about when we talk about deception, like, mathematically? I will propose a definition and give an example that illustrates how this definition operates.
Interrogation Investigation
Take a police officer, Alice, who is interrogating a suspect, Bob, in order to determine whether or not he is guilty of murder. Alice can either convict Bob or let him go, and Bob is either guilty or innocent. The payoff matrix is as follows:
Bob: Alice: | Guilty | Innocent |
Convict | (1,-1) | (-1,-1) |
Let go | (-1,1) | (1,1) |
Whether or not Bob is guilty, he would prefer to be let go; on the other hand, Alice only wants to convict if Bob is guilty, and she wants to let him go otherwise.
When Bob enters the room, it is predetermined that he is either guilty or innocent, Bob does not get to make a decision about this in advance. If Bob is guilty he will behave slightly differently under interrogation than if he isn't: he may have a weaker alibi, there may be inconsistencies in his story, he may exhibit physical symptoms associated with lying, but all of these things can also happen if Bob is innocent; sometimes innocent people have bad alibis, are inconsistent, and look nervous.
Now in order to define deception, we introduce Judy, an omniscient observer of the interrogation. Judy has the same goal as Alice, but unlike Alice, Judy also knows whether Bob is guilty or not. Further, Judy knows exactly how Bob's behaviors impact the likelihood of Alice convicting Bob (i.e. Judy can read both Alice's and Bob's minds). We call any subset of observations made by Alice deceptive if Judy would rather Alice make her decision to convict with this set of observations removed from her consideration (as though any memory of seeing these observations had been completely wiped from Alice's mind).
Formalization
Now let's try and formalize this story of deception. We'll describe Alice as a function from a series of observations to a probability space over actions:
with the series of observations given by:
defines a series of functions where each function represents the decision algorithm Alice uses. This function takes the series of observations Alice makes, , as inputs and returns a probability distribution representing the likelihood Alice takes any action. In the interrogation context, this would return the probability that Alice convicts after receiving observations. For example: represents how likely Alice is to convict based on just one observation of Bob.
For some of our later considerations we will want to repeat the game and have Alice update her decision algorithm between repetitions. For this reason we track which round of interrogation we are in by the superscript on the function, so that represents how Alice makes her decision in the first interrogation, represents how Alice makes her decision in the second interrogation, and so on.
Now introduce Judy into the formalization. If Judy would rather Alice had not made some number of observations, then we define these observations to be deceptive. Alice does not actually get to decide whether to convict or not based on the observations that have been pruned by Judy, instead this pruning process is only present to define what we mean when we say observations are deceptive (this should remove the possibility of Alice metagaming, where e.g. Judy might always remove all observations whenever Bob is guilty, and remove none whenever Bob is innocent).
Comments on the Formalization
This definition has a few curious properties:
- If Alice if a random hash function from observed data to actions, then Judy will classify observations as deceptive seemingly at random as well. More generally, observations get classified as deceptive by virtue only of their effects on Alice's decision algorithm.
- Even when Bob is completely innocent and being entirely honest, if there are 'inconvenient truths' in Bob's story (e.g. that he was the last person to be seen with the victim alive), then these 'inconvenient truths' may be classified as deceptive.
- We may have a set of observations that gets classified as deceptive, despite any subset of this set not being classified as deceptive. Similarly, we may have two observations that are individually classified as deceptive but Judy does not classify their union as deceptive.
- Bob might, even when innocent, decide to try lying and coming up with a more convincing alibi than the truth (this is the converse of 'inconvenient truths'). If Alice doesn't detect that Bob is lying about his alibi, these observations will not be classified as deceptive, despite Bob actually 'deceiving' Alice about what he was doing at the time of the murder. More generally, this definition is entirely agnostic about Bob's internal motivations.
Despite these weird properties, the value of this definition is that observations will be classified as deceptive if and only if they make it more likely for Alice to make the wrong decision. Another way of saying this is: when no observations or set of observations are classified as deceptive then Alice is very likely to make the right decision, and as the number of deceptive observations increases, so does the likelihood that Alice makes the wrong decision.
The strange properties of the definition above seem to me to be more likely to manifest themselves in a simplified toy example, such as the one we're using. If we instead took the example of a game of poker, where both players are relatively adept (e.g. Alice is not a hash function, or some such other 'insane' policy) then the observations Judy classifies as deceptive will more closely match our intuitions, because Alice making good decisions in poker depends more explicitly on Alice having an accurate model of when (and why) Bob is deceiving her.
Updates and Repetition
Now we repeat this interrogation process times, so that Alice interrogates Bob, Bob, , Bob, who are each randomly either guilty or innocent with equal probability. At the end of each interrogation, Judy will tell Alice whether she made the right decision (but not what observations Judy classified as deceptive), and allow Alice to update her decision algorithm based on this information. Then we would expect, on average, fewer sets of observations to be classified as deceptive when Alice is interrogating Bob than when Alice interrogated Bob (i.e. Alice's ability to determine Bob's guilt will improve over time). Of course in a real interrogation setting, Alice would also update her interrogation procedure itself, but for simplicity we have removed this possibility.
Each Bob is drawn randomly from one of two distributions over 'behavior-space.' We have a guilty-Bob distribution of behaviors and an innocent-Bob distribution of behaviors. For Alice's updating to work at all, we would need these distributions to be distinct in behavior-space, otherwise there would be no possible way for Alice to reliably determine Bob's guilt even in principle. In fact, if there is any overlap between these two distributions, then even the theoretically optimal Alice will not be able to perfectly distinguish guilty-Bobs from innocent-Bobs.
More generally, whenever two agents are playing an incomplete information game, and one agent might try to deceive the other, it will be harder for an agent to repeatedly be deceived when they can update on the ground-truth of how their opponents have behaved in the past. (This is, of course, assuming that the behavior of their opponents is in some way correlated with the ground-truth.)
Flaws
A key point is the method by which Alice updates her decision algorithm. If Bob's actions are always entirely uncorrelated with his guilt then we would hope that Alice would converge to understanding this and, in the limit, being maximally unconfident about whether she should convict. Despite this, if we end the process of updating at any finite step, there is always some chance that Alice sees some correlation which gives her a 'superstitious belief' (e.g. up to Bob, whenever Bob was guilty he has always yawned and scratched his ear, by random chance). If Alice starts using these 'superstitious beliefs' to make decisions, then as we repeat interrogations, we could see Judy categorize more observations as deceptive, not less. This is a problem of making sure that Alice's updating procedure doesn't overfit hypotheses to past observations, or equivalently that Alice has good priors for what could possibly be associated with lying.
As noted above, Judy only categorizes observations as deceptive by the effect they have on Alice. It would be nice to have a definition of deceptive behavior that tracks whether Bob is representing his best understanding of the ground-truth (i.e. whether Bob is knowingly lying). It, however, seems to me as though these two notions are in tension with each other, and there simply might not exist one definition that captures both. If this is indeed the case, consider that Alice cares more about making the wrong decision than having an accurate picture of the world, so I would argue this definition of deceptive observation is more robust. (Also, when Alice needs a more accurate picture of the world to make better decisions, this definition will be responsive to that).
Remaining Questions
- Are there any huge flaws with this definition that I've missed? Are there ways to fix them?
- How might Alice's updating procedure change depending on Judy only telling her the truth about Bob's guilt vs Judy also telling her what information was classified as deceptive?
- In the limit, will the number of observations Judy classifies as deceptive be some function of the size of the intersection between the guilty-Bob distribution and innocent-Bob distribution (i.e. a function of Alice's best-case accuracy)?
- Does this definition apply to different examples in interesting or unexpected ways?
2 comments
Comments sorted by top scores.
comment by RHollerith (rhollerith_dot_com) · 2022-06-27T11:33:40.528Z · LW(p) · GW(p)
A false statement can cause a reasoner's beliefs to become more accurate.
Suppose for example that Alice believes falsely that there is an invisible dragon in her garage, but then Bob tells her falsely that all dragons, invisible or not, cannot tolerate the smell of motor oil. Alice decides to believe that, notes that there is a big puddle of motor oil in the center of her garage (because her car leaks oil) and stops believing there is an invisible dragon in her garage.
But by your definition of deception, what Bob told Alice just now is not deceptive because it made Alice's beliefs more accurate, which is all that matters by your definition.
It would be reasonable for Alice to want Bob never to lie to her even when the lie would make her beliefs more accurate, but there is no way for Alice to specify that desire with your formalism. And no way to for Alice to specify the opposite desire, namely, the fact that a lie would be okay with her as long as it makes her beliefs more accurate. And I cannot see a way to improve your definition to allow her to specify that desire.
In summary, although there might be some application, some special circumstance that you did not describe and that I have been unable to imagine, in which it suffices, your definition does not capture all the nuances of deception in human affairs, and I cannot see a way to make it do so without starting over.
But that is not surprising because formalizing things that matter to humans is really hard. Mathematics progresses mainly by focusing on things that are easy to formalize and resigning itself to having only the most tenuous connection to most of the things humans care about.
comment by niederman · 2022-06-27T05:43:46.192Z · LW(p) · GW(p)
It seems to me that there are two distinct things which the English word 'deceptive' describes:
- Information which leads someone to believe something false.
- An action performed with the intent to present someone with deceptive information (in the first sense).
Your formalism is of the first sense, which is why it's unintuitive that it does not take Bob's beliefs into account.
Following is the outline of a simple formalism for the second sense:
- Suppose we have two agents, Alice and Bob. Alice's payoff is determined by the information she knows. For example, maybe she's playing a game of poker.
- Bob transmits some set of information to Alice. Alice recieves all but some subset of . Bob knows the value of , but cannot transmit any further information to Alice.
- Then, Bob predicts Alice's total utility over the rest of her life; we call the value of this prediction . Finally, he predicts what Alice's total utility would be had all of been transmitted (i.e. if ); this prediction is called .
- The information which Bob attempted and failed to transmit to Alice, , is deceptive (2) with respect to Alice if and only if .
I don't think any one formalism can cover both senses. For one thing, it would need to represent the four possible states of "deceptiveness" as a binary "deceptive" property.
I think that it might be best to consider different terms to describe the two different senses of "deceptive."