Understanding Machine Learning (I)

sil-ver

Understanding Machine Learning (I)

post by Rafael Harth (sil-ver) · 2019-12-20T18:22:53.505Z · LW · GW · 12 comments

  The Learning Model
  Generalization / Prior Knowledge / Overfitting
  PAC learning
None
12 comments

Understanding Machine Learning is another book on Miri's research guide. I've previously written about Topology [LW · GW] and Linear Algebra Done Right [LW · GW]. With this one, my write-up ended up looking a lot more like a self-contained introduction into the field and a lot less like a review. I've decided to go with that, and make this a sort of shortened version of the book.

Throughout this series, I will be using bold and italics to mean "this is the first official mention of a new Machine Learning term", whereas normal italics means regular emphasis.

The Learning Model

The book begins with a sort of mission statement. Given a domain set $X$ and a target set $Y$ , we wish to find a function $h : X \to Y$ , which we interchangeably call a hypothesis or a predictor or a classifier. In the example used in the first chapter, $X$ represents the set of all papayas, and $Y$ represents a two-element set where one element is the label "tasty" and the other the label "not-tasty". To compress this into something manageable, we represent papayas by a set of features, in this case color and softness, which we both map onto the interval $[0, 1]$ , and we represent the two labels by 1 and 0. I.e. $X = [0, 1]^{2}$ and $Y = {0, 1}$ . Now the idea is that, if we learn such a predictor, we can apply it to any new papaya we encounter, say on the market, and it will tell us whether or not it is tasty before we've spent any money on it.

This model seems quite general. It covers chess, for example: one could let $X$ be the set of all possible states of the chessboard and $Y$ be the set of all moves; then we want $h$ to map each $x \in X$ onto the best possible move. In fact, it is hard to see which solvable problem could not be modeled this way. Any time we think we can in principle solve some problem, it would consist of us looking at some stuff and then presenting the solution in some way. Now we just let the set of all possible instances of such stuff be $X$ and the set of all possible solutions be $Y$ , and we have an instance of this model.

One would have to put in a lot more work to fully formalize this problem – and this is exactly what the book does. Recall that $X$ is called the domain set. If $Y$ is finite as it is for the papayas example, we also call it the label set.

We model the environment, which creates our labeled domain points $(x, y)$ by a probability distribution $D$ over $X \times Y$ as well as the i.i.d. assumption (not bolded since it's not ML-specific), which states that all elements are independently and identically distributed according to $D$ . In other words, whenever a papaya is plucked in the real world, the environment throws some coins and chooses its color and softness, as well as tastiness, according to $D$ . Of course, we generally assume that the color and softness have a major impact on the tastiness of the papaya, but it is not reasonable to assume that they fully determine its tastiness – surely it is possible that two papayas that are equally colored and equally soft taste differently. Thus, given some particular $x$ , we might reasonably hope that either $(x, 0)$ or $(x, 1)$ is going to be a lot more probable than the other, but both are still going to be possible. On the other hand, in the chess example, one could argue that the domain set fully determines the value of the target set, i.e. the best possible move (although then the idea of having a probability distribution over $X$ becomes problematic). In such a case, one can alternatively model the environment as having a "true" function $f : X \to Y$ . This approach is strictly less general than the former (it corresponds to a probability distribution where for every $x$ , either $(x, 0)$ or $(x, 1)$ has probability 0), but in both cases, the joint distribution / true function is not known, and it is exactly what we are trying to figure out.

We say that the algorithm $A$ which outputs a predictor $h : X \to Y$ is the learner. As its input (assuming we are doing what is called supervised learning here), the learner has access to a training sequence $S$ . This is a sequence of labeled domain points, i.e. $S = ((x_{1}, y_{1}), . . ., (x_{n}, y_{n}))$ which have been generated by the joint distribution $D$ . For this reason, one can consider the training sequence to be a glimpse into how the environment works, and the learner's job to be to generalize from that sequence to all of $D$ , i.e. to output a predictor $h$ which does a good job on not just the training data but also on new elements that are generated by $D$ .

We also define a loss function $ℓ$ to formalize what we mean by "good job". The loss function takes a hypothesis and outputs a number in $[0, 1]$ that is meant to be a rating of how good the hypothesis is, where small values are good and large values are bad. For the case of binary classification (where $| Y | = 2$ ), we will set $ℓ (h) := D {(x, y) \in X \times Y | h (x) \neq y}$ , i.e. the probability that our predictor gets the label on a freshly sampled point wrong (although in practice, one might also be interested in loss functions that penalize false positives more strongly than false negatives, or vice versa). A perfect predictor will have loss 0, but in the setting of a joint distribution $D$ , this will generally be impossible. How good the best possible predictor will perform depends on how well the feature vector describes the domain points. With a poor choice of features, the learner is doomed to fail from the beginning (like if you want it to recognize spam emails, but you only show it aspects of the emails such that there is at best a statistically weak correlation between them and whether the email is spam or not-spam). Of course, we cannot evaluate our loss function since we don't know $D$ , but it is nonetheless possible to make statements about the expected loss of the predictor which a learner outputs.

Finally, we define an empirical loss function $ℓ_{S}$ which describes the error of a predictor on the training sequence, i.e. $ℓ_{S} (h) := \frac{| {(x, y) \in S : h (x) \neq y)} |}{| S |}$ . Unlike the regular loss function, the empirical loss function $ℓ_{S}$ is something we can evaluate. We use the term true error to refer to the output of $ℓ$ , and empirical error to refer to the output of or $ℓ_{S}$ .

This describes (almost) all objects of the learning model. Now the question is how to implement the learner $A$ . Formally, $A$ is described as simply an algorithm, so we have complete freedom to do whatever we want.

Generalization / Prior Knowledge / Overfitting

The learner $A$ takes a training sequence, does whatever it wants with it, and outputs a predictor. How would we go about defining $A$ ? A reasonable-sounding starting point is to choose a predictor which performs well on the training sequence $S$ . Given that $S$ is representative of the environment (recall that it was generated by $D$ ), the hope is that such a predictor would also perform well on the environment.

This alone does not answer the question, though. Suppose we are given the following training sequence (or -set as we will ignore order for now):

(Black points $\leftrightarrow$ tasty instances, red points $\leftrightarrow$ not-tasty instances.) There are many ways to construct a predictor with zero error on this training set. Here is one (namely the predictor that labels points as tasty iff they are in the blue area):

This does not seem like a good predictor. For example, the green point I added (which is meant to be a new instance not part of the training sequence) would almost certainly correspond to a tasty papaya, yet our hypothesis predicts that it is not tasty. Nonetheless, this predictor has zero error on the training sequence (by the definition of $ℓ_{S}$ ). An even more extreme example of this is $A_{memorize}$ , which simply remembers all domain points and outputs a hypothesis that only labels those instances as positive that have occurred in the training sequence with a positive label. The process of outputting such a hypothesis is often called overfitting. It's said that the hypothesis fits the data "too well," as the book puts it. But I think this is not quite the right way to look at it. Consider this alternative hypothesis:

This hypothesis seems better, but not because it is a worse fit for the data – both hypotheses have zero empirical error. There is nothing wrong with fitting the data, there is only something wrong with fitting the data at the cost of outputting an "unreasonable looking" predictor. I think the way to arrive at a more accurate description is by looking at it from a Bayesian perspective. We have a prior on how a proper predictor will look like, and this prior says that the first predictor is worse (as illustrated by how it classifies the green point). However, suppose we had a new example right next to the green point, and it was not-tasty. We would probably still dislike that predictor – the one not-tasty papaya could be an outlier. On the other hand, if we had a million sample points right next to the green one, all of which not-tasty, we would eventually admit that something super weird is really going on here and a very particular softness-color combination really does lead to papayas being not-tasty.

I think the correct framing of the overfit/underfit dilemma is that it is about how far to update away from priors based on evidence, or alternatively, how to weigh priors vs evidence. It's also worth noting that if we had a training sequence that included all possible instances (assuming there is a true labeling function for the sake of this example), we could disregard priors altogether – in that case, $A_{memorize}$ would work work perfectly. Analogously, there is always some amount of evidence that should overturn any prior [LW · GW]. Of course, in our formal model, there may be infinitely many possible feature vectors, but in the real world, every instance description is bounded so the space of all instances must be finite.

One crude way of deciding how much to weigh priors vs evidence is by committing to a hypothesis class ahead of time – i.e. to a subset $H \subset {f : X \to Y}$ – and then finding a predictor that performs maximally well on the training sequence, only taking into considerations predictors in the class we committed to (i.e., a predictor in the set ${argmin}_{h \in H} [ℓ_{S} (h)]$ ). It is crude because it only allows the prior to give a binary "yes" or "no" to a hypothesis, rather than a "probably not, but maybe with enough evidence". Suppose (for the papaya example), we pre-commit to the class of axis-aligned rectangles, as the book suggests (i.e. rectangles with two horizontal and two vertical sides). Consider the following training sequence:

This does not seem like that unreasonable of a glimpse into reality – perhaps it is really the case that if green papayas are too soft, they become gross, but if they're more orange or whatever, then it's fine if they're soft. You might not expect that to be the case, but you'd probably assign it some reasonable probability. This is not allowed by the class of axis-aligned rectangles: cording to this class, the only possible predictor is one where both a certain softness area and a certain color area have to be hit, and one factor does not influence the other. Therefore, if we applied a learning algorithm to this training sequence and restricted it to that class, it would have to fit an axis-aligned rectangle in there as best as possible – which would perform much better than chance, but still seems like a suboptimal choice.

A more sophisticated way of answering the "how strongly to weigh evidence vs priors" question is to choose a family of several hypothesis classes, as well as a weighting function that privileges how much we like each class. Then, perhaps the class of axis-aligned rectangles would have the highest initial weighting, but the class of all rectangles would still be rated decently, so that if it performs substantially better – as is the case here (although it would still have nonzero empirical error) – the learning algorithm would end up preferring a non-axis-aligned rectangle. We can also look at this as having uncertainty about the right model in addition to uncertainty about which predictor within the model to choose. More about this later.

PAC learning

Even though I've criticized the pre-commitment approach, it is a good place to begin a more formal analysis. In particular, we can now talk about a hypothesis class being "learnable". Here, being learnable means that, with sufficiently many training instances, one can prove that, a predictor that performs optimally on the training sequence will probably perform almost as well on the real world as the hypothesis which has the lowest real error in the class. Crucially, this holds for every joint probability distribution $D$ .

So this definition is not about how well the hypothesis we end up with performs on the real world in absolute terms – that's not something which is possible to bound, because the hypothesis class may just not contain any good predictor at all (as is the case in the example above if we use the class of axis-aligned rectangles, and of course other classes could be much worse). Rather, it is about how good of a choice the lowest-empirical-error hypothesis is given that we are stuck with this class – i.e. how large is the difference between the real error of [the hypothesis with minimal empirical error] and the real error of the best hypothesis, i.e. [the hypothesis with minimal real error]. Such a bound is important because, since we can't measure the real error, we cannot output the best hypothesis directly.

There are also two caveats built into the definition. One is the "almost" part; one can never guarantee that the hypothesis with minimal empirical error is exactly as good as the hypothesis with minimal real error because there is randomness in the creation of the training sequence. The second is the "probably" part, one can never upper bound the difference with full certainty because there is always some chance that the training sequence just happens to be unrepresentative. For example, there is always some chance that the training sequence doesn't contain any positively labeled points, in which case $A$ has no way of reliably outputting anything reasonable. However, as the training sequence gets larger, this becomes increasingly unlikely.

The complete formal definition can be stated thus (where we write $A (S)$ to denote the predictor which $A$ outputs upon receiving the training sequence $S$ ):

A hypothesis class $H$ is called PAC learnable iff there exists a learner $A$ and a function $m^{*} : (0, 1)^{2} \to N$ such that, for any probability distribution $D$ over $X \times Y$ , any "gap" $ϵ \in (0, 1)$ , and any failure probability $δ \in (0, 1)$ , if the training sequence $S$ consists of at least $m^{*} (ϵ, δ)$ i.i.d. examples sampled via $D$ , then with probability at least $1 - δ$ over the choice of $S$ , it holds that $ℓ (A (S)) \leq ℓ (h^{*}) + ϵ$ for all $h^{*} \in {argmin}_{h \in H} [ℓ (h)]$ .

While this defintion is complicated, it is not unnecessarily complicated – the complexity is just an inherent feature of the thing we are trying to define. This kind of learnability is called PAC learnability because the definition states that $A (S)$ is probably (hence the $δ$ ) approximately (hence the $ϵ)$ correct (hence the " $\leq$ ").

One can prove that, whenever a class is PAC-learnable at all, then the algorithm $A_{ERM}$ which chooses its predictor such that $A_{ERM} (S) \in {argmin}_{h \in H} [ℓ_{S} (h)]$ is a successful learner. In other words, we can always just choose the predictor that performs best on the training sequence (or one of them if several are tied). The abbreviation ERM stands for empirical risk minimization.

One can also prove that every finite hypothesis class is PAC learnable. In particular, the sample complexity (this is a name for the output of the function $m^{*}$ which upper-bounds the number of training points needed to be probability approximately correct) is upper bounded by the term $\frac{ln (| H | / δ)}{ϵ}$ .

Let's reflect on this result a bit. If we use $A_{ERM}$ as our learner, then learnability implies that the difference between the true error of $A_{ERM} (S)$ and the true error of the best hypothesis in $H$ is probably very small. In particular, if $H$ consists of $10000$ hypotheses, and we want a $99 %$ assurance that the difference between $A_{ERM} (S)$ and the best predictor in $H$ is at most 0.1, then $ln (10000 / 0.01) / 0.1 \approx 138.15$ so we if we have a training sequence with 139 labeled points, we're good. The only remaining error is then the error of the best hypothesis in $H$ . This is also called the approximation error, while the error we just bounded is called the estimation error.

Now consider declaring the class $H$ to be the set of all predictors such that there exists a C program, whose compiled code is at most 10000 bits long, that implements the predictor. It's a safe bet that this class has a very low approximation error for almost all practically relevant cases (or if not then we can allow longer programs). Given that the dependence on the size of $H$ – in this case, $2^{10000}$ – is only logarithmic, it increases the sample complexity by a factor smaller than $10000$ . If we then demand an estimation error of at most 0.01 and some very high confidence (note that the dependence on the confidence is also logarithmic, so even demanding a 0.9999 certainty doesn't change much), we would still end up with a sample complexity of only around a million. Gathering a million examples is probably doable in many cases (although as George points out [LW(p) · GW(p)], assuming that the i.i.d. assumption holds may not be realistic in practice). Thus, we have just derived a general algorithm which, provided there is a way to obtain sufficiently many i.i.d. data points, solves every binary classification task, namely

1. Perform an exhaustive search over all C programs of compiled length in bits at most 10000

2. Throw out all programs that don't implement a predictor

3. Compute the empirical risk for each one that does

4. Output the one with the lowest empircal risk.

For the output of this algorithm, both the approximation error (i.e. the error of the best C program) and the estimation error (i.e. the difference between the error of the hypothesis $A_{ERM}$ outputs and that of the best C program) are going to be very low!

Back in the real world, step 1 does of course mean that this algorithm is utterly useless in practice, but only due to its runtime. With unbounded computing power, this approach would work wonderfully. In principle, everything (well, at least every binary classification task) is learnable from a relatively small set of examples, certainly fewer examples than what is often available in practice. It doesn't seem completely unreasonable to consider this some (small) amount of evidence on the importance of talent vs data, and perhaps as decent evidence on the inference power of a really smart AI?

The next post will discuss some theoretical limits of learning and prove the No-Free-Lunch theorem.

12 comments

Comments sorted by top scores.

comment by TurnTrout · 2019-12-20T19:29:26.080Z · LW(p) · GW(p)

Great review. I really enjoyed this book, and I’m glad to see it written up. Will you be covering VC dimension?

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2019-12-20T19:40:38.061Z · LW(p) · GW(p)

Yes! It's in the second post, which is already (almost) finished. VC dimension was one of my favorite concepts.

comment by philh · 2019-12-24T18:24:55.993Z · LW(p) · GW(p)

In the case of binary classification (i.e. where |Y|=2), we always set ℓ(h):=D{(x,y)∈X×Y|h(x)≠y}, i.e. the probability that our predictor gets the label on a freshly sampled point wrong.

Does this assume that we want a false negative to be penalized the same as a false positive? Or would that be adjusted for in some other place?

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2019-12-24T19:50:35.949Z · LW(p) · GW(p)

Yeah, it does mean that. I hadn't thought about it until now, but you're right that it's not at all obvious. The book never mentions weighing them differently, but the model certainly allows it. It may be that an asymmetric loss function complicates the results; I'd have to check the proofs.

I'll edit the part you quoted to make it a weaker claim.

comment by George3d6 · 2019-12-21T21:06:47.964Z · LW(p) · GW(p)

The idea of PAC learning always rubbed me the wrong way, since it contains the silent assumption that there are no spurious correlations in the dataset (or to word it in a non-causal way, that if no latent variables are required to fit a model from the training data, no latent variables will be required to predict test data with the same accuracy) and if people run with the idea of make claims like:

With unbounded computing power, this approach would work wonderfully. In principle, everything (well, at least every binary classification task) is learnable from a relatively small set of examples, certainly fewer examples than what is often available in practice. It doesn't seem completely unreasonable to consider this some (small) amount of evidence on the importance of talent vs data, and perhaps as decent evidence on the inference power of a really smart AI?

Which is inherently untrue in any field other than maybe particle physics if experiments were ran with god-like instruments that had no errors.

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2019-12-21T22:16:15.149Z · LW(p) · GW(p)

the silent assumption that there are no spurious correlations in the dataset

Isn't that the i.d.d assumption?

We model the environment, which creates our labeled domain points (x,y) by a probability distribution D over X×Y as well as the i.d.d. assumption (not bolded since it's not ML-specific), which states that all elements are independently and identically distributed according to D.

If so it's not silent – it's a formal part of the model. The statements about PAC learnability are mathematical proofs, so there's no room to argue with those, there's only room to argue that the model is not realistic.

Although I admit I didn't mention the i.d.d assumption in the paragraph that you quoted.

Replies from: George3d6

↑ comment by George3d6 · 2019-12-21T23:29:32.512Z · LW(p) · GW(p)

Well, the i.d.d assumption or the CLT or whatever variation you want to go to is, in my opinion, rather pointless. It aims at modeling exactly the kind of simplistic idealized system that don't exist in the real world.

If you look at most important real world systems, from biological organisms, to stock markets, to traffic. The variables are basically the opposite of i.d.d., they are all correlated, even worse, the correlations aren't linear.

You can model traffic all you want and try to drive in an "optimal" way but then you will reach the "driving like an utter asshole" edge case which has the unexpected result of "tailgating".

You can model the immune system based on over 80 years of observations in an organism and try to tweak it just a tiny it to help fight an infection and the infinitesimal tweak will cause the "cytokine storm" edge case (which will never have been observed before, since it's usually fatal).

Further more the criticisms above don't even mention the idea that, again, you could have a process where all the variables are i.d.d (or can be modeled as such), but you just happen to have missed some variables which are critical for some states, but not for other states. So you get a model that's good for a large amount of state and then over-fits on the states where you are missing the critical variables (e.g. this is the problem with things like predicting earthquakes or the movement of the Earth's magnetic pole).

All problems where i.d.d reasonably fits the issue are:

a) Solved (e.g. predicting tides reasonably accurately)

b) Solvable from a statistical perspective but suffer from data gathering issues (e.g. a lot of problem in physics, where running the experiments is the hard part)

c) Boring and/or imaginary

Replies from: sil-ver, interstice

↑ comment by Rafael Harth (sil-ver) · 2019-12-22T08:11:46.340Z · LW(p) · GW(p)

Right. I just took issue with the "unsaid" part because it makes it sound like the book makes statements that are untrue, when in fact it can at worst make statements that aren't meaningful ("if this unrealistic assumption holds, then stuff follows"). You can call it pointless, but not silent, because well it's not.

I'm of course completely unqualified to judge how realistic the i.d.d. assumption is, having never used ML in practice. I edited the paragraph you quoted to add a disclaimer that it is only true if the i.d.d assumption holds.

But I'd point out that this is a text book, so even if correlations are as problematic as you say, it is still a reasonable choice to present the idealized model first and then later discuss ways to model correlations in the data. No idea if this actually happens at some point.

↑ comment by interstice · 2019-12-22T00:16:39.115Z · LW(p) · GW(p)

This seems much too strong, lots of interesting unsolved problems can be cast as i.i.d. Video classification, for example, can be cast as i.i.d., where the distribution is over different videos, not individual frames.

comment by philip_b (crabman) · 2019-12-23T10:22:47.045Z · LW(p) · GW(p)

Two nitpicks:

like if you want it to recognize spam emails, but you only show it aspects of the emails such that there is at best a statistically weak correlation between them and whether the email is spam or not-spam

Here "statistically weak correlation" should be "not a lot of mutual information", since correlation is only about linear dependence between random variables.

i.d.d.

Should be i.i.d.

Replies from: sil-ver

↑ comment by Rafael Harth (sil-ver) · 2019-12-23T11:42:01.843Z · LW(p) · GW(p)

Thanks; fixed the second one.

On the first – I don't understand why correlation doesn't apply. "Spam" is a random variable, and whatever feature you show, like "length", is also a random variable – right?

comment by Alex K. Chen (parrot) (alex-k-chen) · 2023-09-24T16:29:42.995Z · LW(p) · GW(p)

Has anone used a Probably Approximately Correct (PAC) framework to develop a rough formalization of taste of a learner A(S)? [and whether a learner "has it in them" to correctly classify H, a complex hypothesis space that is more complicated than the S they've seen?]

Especially learners who come from "shitty environments" (have poor S), but still have the A(S) in them to suddenly "jump" once they have exposure to the right people

Some “A” functions might have unusually nonlinear behavior once a person is exposed to the right set of tutors or environment (and some “A” functions never have it in them)

Understanding Machine Learning (I)

Contents

The Learning Model

Generalization / Prior Knowledge / Overfitting

PAC learning

12 comments