Distinguishing logistic curves

stuart_armstrong

Distinguishing logistic curves

post by Stuart_Armstrong · 2020-05-15T11:38:04.516Z · LW · GW · 0 comments

  Logistic curves
    Figuring out the right curve
    Noisy Sampling
    Bounding results
  How to sample
    Sampling very large positive or negative values
    Finding (any) turning point
    Other difficulties
  Proof
None
No comments

This post will attempt to formalise the intuition that "it's hard to figure out the turning point of a logistic curve, at least until after that turning point". Ashort visual "proof" can also be found here [LW · GW].

Logistic curves

The logistic curves look like this:

Logistic curves can be specified by three parameters, $c$ , $l > 0$ , and $k > 0$ . Their equation is then:

$F_{c, l, k} (x) = l \frac{e^{k x}}{e^{k (x - c)} + 1} .$

Note that this $l$ is different from that in this article. The turning point of this curve is at $x = c$ (where it takes the value of $l e^{k c} / 2$ ) while its supremum is $l e^{k c}$ ; it tends to this value as $x \to \infty$ . Take the limit as $c \to \infty$ as being the exponentials:

$F_{\infty, l, k} (x) = l e^{k x} .$

Figuring out the right curve

We'll imagine a simple Bayesian setup. An analyst of logistic curves is seeing data from one distribution, and has two hypotheses about it: $F_{C, L, K}$ , for values $C$ , $L$ , and $K$ , and $F_{c, l, k}$ with values $c$ , $l$ , and $k$ . We'll designate $F_{C, L, K}$ by $F$ and $F_{c, l, k}$ by $f$ .

Now, the true distribution is $F$ , but the analyst doesn't know that. The question we're asking is thus:

Starting from an equal prior on $F$ and $f$ , how much of what kind of observation will the analyst need to establish that $F$ is the true underlying distribution?

Noisy Sampling

If the analyst can sample noiselessly from the curve, then three samples should generally suffice to fully establish $F$ , and one sample should generally suffice to distinguish $F$ from $f$ . So we'll consider the (realistic) situation where there is noise in the samples.

So assume the analyst samples $n$ points, at $\to x = (x_{1}, x_{2}, \dots, x_{n})$ . In return, it gets $n$ values, $\to y = (y_{1}, y_{2}, \dots, y_{n})$ ; these are sampled independently from $N (F (x_{i}), σ_{i}^{2})$ . This is a normal distribution with mean $F (x_{i})$ and standard deviation $σ_{i}$ .

The analyst is assumed to know the vector $\to σ = (σ_{1}, \dots σ_{n})$ , and indeed everything about this setup, with one exception: whether the means of these normal distributions are $F (x_{i})$ or $f (x_{i})$ .

Let $P_{a}$ be the analyst's probability distribution. Their prior gives equal weight to both hypotheses: $P_{a} (F) = P_{a} (f) = 1 / 2$ . Let $O_{\to x}^{\to y}$ be the analyst observing $\to y$ after sampling from $\to x$ ; their posterior is then $P_{a} (f ∣ O_{\to x}^{\to y})$ .

Note that, from our perspective, $P_{a} (F ∣ \to x)$ is a random variable whose distribution we know. Say that:

$\to x$ establishes the difference between $F$ and $f$ if the expectation of $P_{a} (F ∣ \to x)$ is less than $1 / 16$ .

We could choose other criteria, and this a relatively loose one. It only assumes three bits of information in favour of $F$ over $f$ . Note that since $P_{a} \geq 0$ , we can get probability bounds on $P_{a}$ as well, from this result; for instance:

If $E [P_{a} (f ∣ \to x)] \leq q / 4$ , then with probability at least $3 / 4$ , $P_{a} (f ∣ \to x) \leq q$ .

So, for instance, our criteria above ensures that with probability at least $3 / 4$ , $P_{a} (f ∣ \to x) \leq 1 / 4$ . Conversely, since $P_{a} \leq 1$ , probability bounds on $P_{a}$ translate into expectation bounds, making the two approaches loosely equivalent. We'll use expectation bounds, as they are more natural for this random variable.

Bounding results

Our first result, proven in later sections, is a lower bound on the expectation of $P_{a} (f ∣ \to x)$ :

$\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} \prod_{i = 1}^{n} [1 - erf (\frac{δ_{i}}{σ_{i} 2 \sqrt{2}})] . & (1) \end{matrix}$

Here $erf$ is the error function and $δ_{i}$ is the absolute difference between $F (x_{i})$ and $f (x_{i})$ . We can then get the slightly looser but more easily computable bound:

$\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} [1 - \sum_{i = 1}^{n} \frac{δ_{i}}{σ_{i} \sqrt{2 π}}] . & (2) \end{matrix}$

How to sample

Sampling very large positive or negative values

Note that:

$0 \leq F_{c, l, k} = l \frac{e^{k x}}{e^{k (x - c)} + 1} \leq l e^{k x} .$

Hence we can bound the $δ_{i}$ via:

$δ_{i} = | F (x_{i}) - f (x_{i}) | \leq max (L e^{K x_{i}}, l e^{k x_{i}}) .$

Let $m (x_{i}) = max (L e^{K x_{i}}, l e^{k x_{i}})$ ; note this is an increasing function, exponential for very negative $x_{i}$ .

Assume we sample $n^{'}$ different $x_{i}$ values below a very negative $X$ ; then if $σ^{-}$ is the minimum of all the $σ_{i}$ for $x_{i} \leq X$ , the contribution of these $n^{'}$ points to the expectation bound is at most $0$ and at least:

$- n^{'} \frac{1}{σ^{-} \sqrt{2 π}} m (X) .$

This gives our result for very negative values:

If noise is irreducible below $σ^{-}$ , then sampling below a very negative $X$ will have very little impact on the analyst's posterior. To get a better result, increasing the $X$ (exponential effect) is generally more powerful than decreasing $σ^{-}$ (inverse linear effect), and much more powerful than getting more samples (linear effect).

The behaviour for large positive $x_{i}$ is also clear: unless $l e^{k c} = L e^{K C}$ , $f$ and $F$ must have different asymptotes. So as long as there is an upper bound $σ^{+}$ on the noise, sampling the curve at large values will cause the expectation of $P_{a} (f ∣ \to x)$ to converge to $0$ . For large $x_{i}$ , this is essentially trying to distinguish $N (l e^{k c}, σ^{+})$ from $N (L e^{K C}, σ^{+})$ , so each extra sample applies a multiplicative factor to the expected value of $P_{a} (f ∣ \to x)$ . So, for large samples, the probability of the wrong function converges geometrically to zero in the number of samples.

Finding (any) turning point

So, distinguishing $F$ and $f$ for very low samples is very hard, but distinguishing them for very high samples is generally not very useful. But enough about asymptotic behaviour. The question is: what happens in between, closer to the turning points $C$ and $c$ of $F$ and $f$ ?

We can make some scaling and translation choices to simplify $F$ , setting $c = 0$ and $l = k = 1$ . So the turning point is at $0$ ( $y$ value $1 / 2$ ) and the supremum is $1$ :

$F (x) = \frac{e^{x}}{e^{x} + 1} = \frac{1}{1 + e^{- x}} .$

Assume now that the noise $σ_{i}$ is a constant $σ$ . We want $f$ to have a different turning point, so that can see how easy it is to identify this turning point. Let's choose the worst possible scenario: $f$ is an exponential function with no turning point:

$f (x) = l e^{k x} .$

So, how can the analyst sample so that they have the greatest possible chance of distinguishing between a true function with a turning point at $0$ , and a false function with no turning point at all?

We have two free variables: the $k$ and $l$ of $f$ , and we typically want to see how well we can do when sampling below a given $X$ . For constant $σ$ , the elements of the bound are given by:

$\frac{δ_{i}}{σ \sqrt{2 π}} = \frac{| F (x_{i}) - f (x_{i}) |}{σ \sqrt{2 π}} = ∣ ∣ ∣ ∣ \frac{l e^{(k + 1) x} + l e^{k x} - e^{x}}{(e^{x} + 1) \sqrt{2 π}} ∣ ∣ ∣ ∣ \frac{1}{σ} .$

Define $d (l, k) (x)$ as this function, without the $σ$ term. We'll now consider $X = 0$ , ie we are sampling at any point before the turning point. Then some experimentation allows us to minimize $d (l, k) (x)$ for negative values, by setting $l = 0.51$ and $k = 0.69$ ; given these values, $d (l, k) (x)$ is bounded above by $0.007$ :

Consequently we can use equation (2) to get a bound:

$\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} [1 - \frac{n}{σ} 0.007] \end{matrix}$

To establish the difference between $F$ and $f$ , we need this below $1 / 16$ . Consequently, we need $\frac{n}{σ} 0.007 \geq \frac{7}{8}$ , or

$\frac{n}{σ} \geq 125.$

So if the noise is $1 / 200$ , ie $1 %$ of the value at the turning point, a single data point might suffice. But if the noise is $10 %$ of the value at the turning point, then at least seven samples are needed.

Anyway, that's all the way to the turning point; what about if $X$ is chosen so that the value $F (X)$ is $1 / 3$ (two thirds of the value at the turning point) or $1 / 4$ (a half of the value at the turning point)? To get these, we need $X = - log (2)$ and $X = - log (3)$ , respectively. We'll also look at past the turning point, $X = log (2)$ and $log (3)$ .

Optimising $l$ and $k$ for all five situations give:

For $X = - log (3)$ , $\frac{n}{σ} 0.0015 \geq \frac{7}{8}$ or $\frac{n}{σ} \geq 583$ .
For $X = - log (2)$ , $\frac{n}{σ} 0.0027 \geq \frac{7}{8}$ or $\frac{n}{σ} \geq 324$ .
For $X = 0$ , $\frac{n}{σ} 0.007 \geq \frac{7}{8}$ or $\frac{n}{σ} \geq 125$ .
For $X = log (2)$ , $\frac{n}{σ} 0.015 \geq \frac{7}{8}$ or $\frac{n}{σ} \geq 58$ .
For $X = log (3)$ , $\frac{n}{σ} 0.021 \geq \frac{7}{8}$ or $\frac{n}{σ} \geq 41$ .

But equation (2) gives poor bounds for low $σ$ . Using equation (1) instead, for $σ = 1 / 200$ ( $1 %$ of turning point $y$ -value) and $σ = 1 / 20$ ( $10 %$ of turning point $y$ -value), gives the number $n$ of samples needed as:

$\begin{matrix} X & σ = 1 / 200 & σ = 1 / 20 - log (3) & 6 & 69 - log (2) & 3 & 38 0 & 1 & 14 log (2) & 1 & 6 log (3) & 1 & 5 \end{matrix}$

Other difficulties

The bounds above are only good if the values are sampled independently and close to the peak of the $d (l, k)$ function. If the values are not independent - as values sampled close to each other tend not to be - then more must be sampled, and the same goes if the values are sampled away from the peaks.

The other issue is that, here, we've first optimised $l$ and $k$ for minimal peak of $d (l, k)$ , then assumed the best $x_{i}$ were sampled. We need to consider the opposite situations, too: given the sampled $x_{i}$ , optimise $l$ and $k$ . So, even if $n$ samples are enough to distinguish $F$ from this specific $f$ , there are other exponential functions $F_{\infty, l, k}$ that would be harder to distinguish from $F$ .

Proof

This section will prove the bounds in equation (1) and (2).

By Bayes rule:

$\begin{matrix} P_{a} (f ∣ O_{\to x}^{\to y}) & = \frac{P_{a} (O_{\to x}^{\to y} ∣ f) P_{a} (f)}{P_{a} (O_{\to x}^{\to y})} = \frac{P_{a} (O_{\to x}^{\to y} ∣ f) P_{a} (f)}{P_{a} (O_{\to x}^{\to y} ∣ f) P_{a} (f) + P_{a} (O_{\to x}^{\to y} ∣ F) P_{a} (F)} = \frac{P_{a} (O_{\to x}^{\to y} ∣ f)}{P_{a} (O_{\to x}^{\to y} ∣ f) + P_{a} (O_{\to x}^{\to y} ∣ F)}, \end{matrix}$

since the prior probabilities are equal. Since the analyst knows the true variances, $P_{a} (O_{\to x}^{\to y} ∣ f) = P (O_{\to x}^{\to y} ∣ f)$ and similarly for $F$ : we can replace the analyst's probabilities with the true probabilities. So, contracting $P (O_{\to x}^{\to y} ∣ F)$ as $p_{1} (\to y)$ and $P (O_{\to x}^{\to y} ∣ f)$ as $p_{2} (\to y)$ , we get:

$\begin{matrix} P_{a} (f ∣ O_{\to x}^{\to y}) & = \frac{p_{2} (\to y)}{p_{2} (\to y) + p_{1} (\to y))} = \frac{1}{1 + p_{1} (\to y) p_{2} (\to y)^{- 1}} . \end{matrix}$

To get the true expectation of this $P_{a}$ , we need to integrate over the possible values of $\to y$ , weighted by the true probability $P (O_{\to x}^{\to y} ∣ F) = p_{1} (\to y)$ of this happening:

$\begin{matrix} E [P_{a} (f ∣ \to x)] & = \int \frac{1}{1 + p_{1} (\to y) p_{2} (\to y)^{- 1}} p_{1} (\to y) d \to y = \int \frac{1}{p_{1} (\to y)^{- 1} + p_{2} (\to y)^{- 1}} d \to y . \end{matrix}$

Note that $p_{1} (\to y)$ and $p_{2} (\to y)^{- 1}$ are both (strictly) positive, and that $1 / (p_{1} (\to y)^{- 1} + p_{2} (\to y)^{- 1})$ is half the harmonic mean of the two.

The harmonic mean of any number of positive elements is bounded below by the minimum value of its arguments. Hence: $\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} \int min (p_{1} (\to y), p_{2} (\to y)) d \to y . \end{matrix}$

Now, since the noise is independent, $p_{j} (\to y) = \prod_{i = 1}^{n} p_{j} (y_{i})$ where $p_{1} (y_{i}) = P (y_{i} ∣ x_{i}, F)$ and $p_{2} (y_{i}) = P (y_{i} ∣ x_{i}, f)$ . For positive elements, the minimum of two products is greater than or equal to the product of minimums, so

$\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} \int_{- \infty}^{\infty} \prod_{i = 1}^{n} min (p_{1} (y_{i}), p_{2} (y_{i})) d \to y \geq \frac{1}{2} \prod_{i = 1}^{n} \int_{- \infty}^{\infty} min (p_{1} (y_{i}), p_{2} (y_{i})) d y_{i} . \end{matrix}$

The expressions $min (p_{1} (y_{i}), p_{2} (y_{i}))$ can be expressed analytically. If $φ$ is the probability density function of $N (0, 1)$ , the normal distribution with mean $0$ and variance $1$ , then

$\begin{matrix} p_{1} (y_{i}) & = \frac{1}{σ_{i}} φ (\frac{F (x_{i}) - y_{i}}{σ_{i}}), p_{2} (y_{i}) & = \frac{1}{σ_{i}} φ (\frac{f (x_{i}) - y_{i}}{σ_{i}}) . \end{matrix}$

So the two curves are normal curves with the same variance and means $F (x_{i})$ and $f (x_{i})$ . Assume, without loss of generality, that $F (x_{i}) \leq f (y_{i})$ . Then the two functions will be equal at the midpoint $μ_{i} = (F (x_{i}) + f (x_{i})) / 2$ , and for $y_{i} \leq μ_{i}$ , $p_{1} (y_{i})$ is higher, while for $y_{i} \geq μ_{i}$ , $p_{2} (y_{i})$ is higher.

Thus $min (p_{1} (y_{i}), p_{2} (y_{i})) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} \frac{1}{σ_{i}} φ (\frac{f (x_{i}) - y_{i}}{σ_{i}}), y_{i} \leq μ_{i}, \frac{1}{σ_{i}} φ (\frac{F (x_{i}) - y_{i}}{σ_{i}}), y_{i} \geq μ_{i} . \end{matrix}$

If $δ_{i} = | F (x_{i}) - f (x_{i}) |$ is the distance between the two peaks, this becomes: $min (p_{1} (y_{i}), p_{2} (y_{i})) = ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} \frac{1}{σ_{i}} φ (\frac{μ_{i} + δ_{i} / 2 - y_{i}}{σ_{i}}), y_{i} \leq μ_{i}, \frac{1}{σ_{i}} φ (\frac{μ_{i} - δ_{i} / 2 - y_{i}}{σ_{i}}), y_{i} \geq μ_{i} . \end{matrix}$

Since the integral of $φ$ is $1 / 2 [1 + erf (y / \sqrt{2})]$ , for $erf$ the error function, we can bound the expected probability by:

$\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} \prod_{i = 1}^{n} \int_{- \infty}^{\infty} min (p_{1} (y_{i}), p_{2} (y_{i})) d y_{i} \geq \frac{1}{2} \prod_{i = 1}^{n} (\int_{- \infty}^{μ_{i}} \frac{1}{σ_{i}} φ (\frac{μ_{i} + δ_{i} / 2 - y_{i}}{σ_{i}}) d y_{i} + \int_{μ_{i}}^{\infty} \frac{1}{σ_{i}} φ (\frac{μ_{i} - δ_{i} / 2 - y_{i}}{σ_{i}}) d y_{i}) \geq \frac{1}{2} \prod_{i = 1}^{n} (\int_{- \infty}^{- δ_{i} / 2} \frac{1}{σ_{i}} φ (\frac{- y_{i}}{σ_{i}}) d y_{i} + \int_{δ_{i} / 2}^{\infty} \frac{1}{σ_{i}} φ (\frac{- y_{i}}{σ_{i}}) d y_{i}) \geq \frac{1}{2} \prod_{i = 1}^{n} (\frac{1}{2} [1 + erf (\frac{- δ_{i} / 2}{σ_{i} \sqrt{2}})] + \frac{1}{2} [1 + erf (\frac{- δ_{i} / 2}{σ_{i} \sqrt{2}})]) \geq \frac{1}{2} \prod_{i = 1}^{n} [1 - erf (\frac{δ_{i}}{σ_{i} 2 \sqrt{2}})] . \end{matrix}$

For positive values, the error function is concave, and it has derivative $2 / \sqrt{π}$ at the origin, so

$erf (\frac{δ_{i}}{σ_{i} 2 \sqrt{2}}) \leq \frac{2}{\sqrt{π}} \frac{δ_{i}}{σ_{i} 2 \sqrt{2}} = \frac{δ_{i}}{σ_{i} \sqrt{2 π}} .$

Consequently

$\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} \prod_{i = 1}^{n} [1 - \frac{δ_{i}}{σ_{i} \sqrt{2 π}}] . \end{matrix}$

Using the fact that for $x, y$ positive, $(1 - x) (1 - y) \geq 1 - (x + y)$ , we get a final bound:

$\begin{matrix} E [P_{a} (f ∣ \to x)] & \geq \frac{1}{2} [1 - \sum_{i = 1}^{n} \frac{δ_{i}}{σ_{i} \sqrt{2 π}}] \end{matrix}$

0 comments

Comments sorted by top scores.

Distinguishing logistic curves

Contents

Logistic curves

Figuring out the right curve

Noisy Sampling

Bounding results

How to sample

Sampling very large positive or negative values

Finding (any) turning point

Other difficulties

Proof

0 comments