A follow-up probability question: Data samples with different priors
post by PhilGoetz · 2012-10-25T20:07:01.498Z · LW · GW · Legacy · 9 commentsContents
9 comments
(Rewritten entirely after seeing pragmatist's answer.)
In this post, helpful people including DanielLC gave me the multiply-odds-ratios method for combining probability estimates given by independent experts with a constant prior, with many comments about what to do when they aren't independent. (DanielLC's method turns out to be identical to summing up the bits of information for and against the hypothesis, which is what I'd expected to be correct.)
I ran into problems applying this, because sometimes the prior isn't constant across samples. Right now I'm combining different sources of information to choose the correct transcription start site for a gene. These bacterial genes typically have from 1 to 20 possible start sites. The prior is 1 / (number of possible sites).
Suppose I want to figure out the correct likelihood multiplier for the information that a start site overlaps the stop of the previous gene, which I will call property Q. Assume this multiplier, lm, is constant, regardless of the prior. This is reasonable, since we always factor out the prior. Some function of the prior gives me the posterior probability that a site s is the correct start (Q(s) is true), given that O(s). That's P(Q(s) | prior=1/numStarts, O(s)).
Suppose I look just at those cases where numStarts = 4, I find that P(Q(s) | numStarts=4, O(s)) = .9.
9:1 / 1:3 = 27:1
Or I can look at the cases where numStarts=2, and find that in these cases, P(Q(s) | numStarts=2, O(s)) = .95:
19:1 / 1:1 = 19:1
I want to take one pass through the data and come up with a single likelihood multiplier, rather than binning all the data into different groups by numStarts. I think I can just compute it as
(sum of numerator : sum of denominator) over all cases s_i where O(s_i) is true, where
numerator = (numStarts_i-1) * Q(s_i)
denominator = (1-Q(s_i))
Is this correct?
9 comments
Comments sorted by top scores.
comment by pragmatist · 2012-10-25T20:28:32.229Z · LW(p) · GW(p)
I'm having a little trouble parsing what you say here, so I might be interpreting your question wrong.
The basic thing to keep in mind is that the prior odds multiplied by the likelihood ratio (what you call the "odds ratio multiplier") give you the posterior odds. Your problem appears to stem from the fact that you are working directly with prior and posterior probabilities, without converting them to odds. To convert P(A) into the odds for A, divide it by 1 - P(A).
In the first case, numStarts is 4, and your prior is 1/numStarts, so your prior odds are 1:3. Since the posterior probability is 0.9, the posterior odds are 9:1. So your likelihood ratio is:
LR = (9:1) / (1:3) = 27
In the second case your prior is 1/2, so prior odds are 1:1. You are assuming the likelihood ratio is the same, so LR = 27. Your posterior odds, then, are 27 * (1:1) = 27:1. This means your posterior probability is 27/28, or 0.96.
I hope I understood your question correctly.
Replies from: PhilGoetzcomment by Vaniver · 2012-10-25T23:44:57.716Z · LW(p) · GW(p)
Even after the rewrite, it's not entirely clear to me what your hypotheses are, and what your evidence is. I find it a little worrying that everything has a domain of s, but that sometimes the number of possible starts is 2, and sometimes 4, and s's domain is apparently 1-20?
For example, P(Q(s) | numStarts=4, O(s)) = .9 seems like it should be defined for a particular s, like P(Q(s=1)| numStarts=4, O(s=1)) = .9. It's also worthwhile to figure out what the complementary probability is- is it P(!Q(s=1)| numStarts=4, O(s=1)) = .1 or is it P(Q(s!=1)| numStarts=4, O(s=1))=.1? (Those statements seem like they should be identical, if Q is true for exactly one start site, which may not be the case.) Note that Bayes Rule and likelihood ratios are primarily useful for distinguishing between competing hypotheses, and clarity in thinking about which hypotheses you're entertaining often goes a long way towards clarifying these problems.
Replies from: PhilGoetz↑ comment by PhilGoetz · 2012-10-25T23:55:03.981Z · LW(p) · GW(p)
Here s means one start. It's an atom in these equations; it doesn't have a domain. Q(s) is 1 if that start is the true start, 0 if it is not. O(s) is 1 if that start overlaps a start, and 0 if it does not.
Replies from: Vaniver↑ comment by Vaniver · 2012-10-26T04:13:47.706Z · LW(p) · GW(p)
So, if there are two start sites, you'll have Q(1) and Q(2) and those will add up to 1? Or are Q(1) and Q(2) decoupled (they could both be start sites, or neither)? O(1) would then be the likelihood ratio that the start is at 1 rather than 2?
Part of my confusion is that you state the priors are 1/numStarts, but then you talk about aggregating experiments where numStarts has different values. This could cause major problems- if you thought that it was either 1 or 2 (that is, putting a prior of 0 on 3 and 4), but measured 1, 2, 3, and 4, then you won't be aggregating your data very well (since your 0 prior will eat any observations).
It may be easier to express your experiments in the form of P(observation|Q(s)) and then work out what multiple observations mean from the atomic probabilities, rather than starting from a higher level of abstraction. (At the very least, it'll be easier for us to figure out what's going on!)
Replies from: PhilGoetz↑ comment by PhilGoetz · 2012-10-26T16:32:32.893Z · LW(p) · GW(p)
Part of my confusion is that you state the priors are 1/numStarts, but then you talk about aggregating experiments where numStarts has different values.
That's the source of the difficulty. The priors are different for each sample.
It will confuse you to try to assign a number to s. Each 's' is one start site. It's an atom that you make propositions about. It doesn't have content. Biologically, it has a position on the genome, which I use to compute the values of propositions about it.
Replies from: Vaniver↑ comment by Vaniver · 2012-10-26T21:05:49.594Z · LW(p) · GW(p)
This is my current model of your problem:
You have a set S of start sites, each of which we can make propositions about. Each one of them has some position on the genome.
You're interested in looking at each of the start sites and assessing some property- "does this start site overlap the previous gene's stop site?" If that's true for the particular start site s, we say Q(s)=1; otherwise, Q(s)=0 (using 0 and 1 as synonymous with true and false). This is unknown, so we refer to our uncertainty as P(Q(s)), which might starts off as 1/S for all s, or might vary with the start site. Knowing P(Q(i)) doesn't tell us anything about P(Q(j)).
When we do an experiment, we get back an observation about s; suppose it signals either "heads" or "tails," which I'll shorten to H or T. We can calculate P(O(s)=H|Q(s)=1) and P(O(s)=H|Q(s)=0), and from that we can calculate the likelihood ratio used to update Q(s). Note that the likelihood ratio is dependent only on the probabilities of H, and thus is totally independent of the prior probability on Q(s).
We can do the experiments in batches- on, say, four sites at once. It will give a H or T reading for each start site, and the probabilities may depend on the number of sites measured at once. Thus, the likelihood ratio will be different based on the batch size- suppose we call a "heads" result when there are 2 sites tested H2.
Thus, we want to figure out, say, P(Q(s)|&, H2,T4,H4,H4). "&" stands for "all background knowledge," which will basically be our prior, and assuming multiple experiments are independent conditioned on Q(s), then we can just multiply the odds contributed by the prior and each of the tests to get one final estimate for Q(s). Assuming we started off with 1:20, H4 contributes 2:1, T4 contributes 1:2, and H2 contributes 4:1, we end up with 1*4*1*2*2:20*1*2*1*1=2:5, and so Q(s)=2/7.
comment by SarahSrinivasan (GuySrinivasan) · 2012-10-26T17:34:08.309Z · LW(p) · GW(p)
If the sample data that you're using to estimate P(Q(s) | numStarts=4, O(s)) = 0.9 has the same P(numStarts) distribution as the real data you're going to run this over, then you don't need to do anything special; just estimate P(Q(s) | O(s)) directly, caring not about numStarts, and go from there. Since you're not doing that, I assume your sample data and real data have different numStarts distributions.
Here is the information I assume you have to work with. Call numStarts=S. P(Q | S=k) = 1/k P(S) in real data Pd(Q | O,S) from an expert classifier over sample data d, with Pd(S) instead of P(S), and in particular your data d is basically a list of judgments (Q,O,S) that I can aggregate however I choose. P(Q | N) from another bunch of expert classifiers independent of the first
What you'd like is to be able to compute P(Q | N,O) on real data. And to make it nice, do that by P(Q | N,O) = 1 - 1 / (1 + Odds(Q | N,O) with Odds(Q | N,O) = Odds(Q) L(N|Q) L(O|Q)
You already know how to find Odds(Q) and L(N|Q). The question is how to find L(O|Q) on real data given that you have Pd(Q | O,S) rather than P(Q | O,S), the expert's judgment on sample data d rather than real data. The answer as far as I can tell, unless I've missed part of your question or assumptions, is as follows:
L(O|Q) = sum(P(O|Q,S) P(S)) / sum(P(O|~Q,S) P(S))
[note that P(O|Q,S) remains the same across samples]
P(O|Q,S) = P(Q,O|S) / P(Q|S), so (with C=Count)
P(O|Q,S) = C(Q,O,S)/C(S) / (1/k) = k C(Q,O,S)/C(S) and P(O|~Q,S) = (k/(k-1)) C(~Q,O,S)/C(S)
thus
L(O|Q) = sum(k C(Q,O,S=k)/C(S=k) P(S=k)) / sum((k/(k-1)) C(~Q,O,S=k)/C(S=k) P(S=k))
so to calculate L(O|Q) on your real data, first note P(S=k) on your real data, then on your sample data d say
foreach D in d,
. C[D->k]++
. if (D->O) C[D->Q,D->k]++
foreach k
. L[numerator] += Ps[k] * C[Q,k]/C[k] / (1/k)
. L[denominator] += Ps[k] * (C[k]-C[Q,k])/C[k] / ((k-1)/k)
L(O|Q) = L[numerator]/L[denominator]
You have to bin your training data, you don't have to bin your test data.
Edit: I found and fixed a couple of errors so there are probably more. Think, debug, and test for yourself as usual. :D