post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by crabman · 2020-05-14T08:49:24.793Z · LW(p) · GW(p)

“big data” refers to situations with so much training data you can get away with weak priors The most powerful recent advances in machine learning, such as neural networks, all use big data.

This is only partially true. Consider some image classification dataset, say MNIST or CIFAR10 or ImageNet. Consider some convolutional relu network architecture, say, conv2d -> relu -> conv2d -> relu -> conv2d -> relu -> conv2d -> relu -> fullyconnected with some chosen kernel sizes and numbers of channels. Consider some configuration of its weights . Now consider the multilayer perceptron architecture fullyconnected -> relu -> fullyconnected -> relu -> fullyconnected -> relu -> fullyconnected -> relu -> fullyconnected. Clearly, there exist hyperparameters of the multilayer perceptron (numbers of neurons in hidden layers) such that there exists a configuration of weights of the multilayer perceptron, such that the function implemented by the multilayer perceptron with is the same function as the function implemented by the convolutional architecture with . Therefore, the space of functions which can be implemented by the convolutional neural network (with fixed kernel sizes and channel counts) is a subset of the space of functions which can be implemented by the multilayer perceptron (with correctly chosen numbers of neurons). Therefore, training the convolutional relu network is updating on evidence and having a relatively strong prior, while training the multilayer perceptron is updating on evidence and having a relatively weak prior.

Experimentally, if you train the networks described above, the convolutional relu network will learn to classify images well or at least okay-ish. The multilayer perceptron will not learn to classify images well, its accuracy will be much worse. Therefore, the data is not enough to wash away the multilayer perceptron's prior, hence by your definition it can't be called big data. Here I must note that ImageNet is the biggest publically available data for training image classification, so if anything is big data, it should be.

--

Big data uses weak priors. Correcting for bias is a prior. Big data approaches to machine learning therefore have no built-in method of correcting for bias.

This looks like a formal argument, a demonstration or dialectics as Bacon would call it [LW · GW], which uses shabby definitions. I disagree with the conclusion, i.e. with the statement "modern machine learning approaches have no built-in method of correcting for bias". I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.

--

In your example with a non-converging sequence, I think you have a typo - there should be rather than .

comment by lsusr · 2020-05-14T19:10:24.194Z · LW(p) · GW(p)

I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.

The conclusion of my post is that these fixes and techniques are ad-hoc because they are written by the programmer, not by the ML system itself. In other words, the creation of ad-hoc fixes and techniques is not automated.

comment by lsusr · 2020-05-14T18:48:10.422Z · LW(p) · GW(p)

In your example with a non-converging sequence, I think you have a typo - there should be rather than

Fixed. Thank you for the correction.

comment by Alexei · 2020-05-14T21:12:51.977Z · LW(p) · GW(p)

I’m getting more and more interested in this sequence. Very applicable to what I’m doing: analyzing market data.

comment by MoritzG · 2020-05-18T11:01:17.415Z · LW(p) · GW(p)

As we know and you mentioned, humans do learn from small data. We start with priors that are hopefully not too strong and go through the known processes of scientific discovery. NN do not have that meta process or any introspection (yet).

"You cannot solve this problem by minimizing your error over historical data. Insofar as big data minimizes an algorithm's error over historical results ... Big data compensates for weak priors by minimizing an algorithm's error over historical results. Insofar as this is true, big data cannot reason about small data."

NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.

Maybe there will be a evolutionary process where huge NNs are reduced to do inference at "the edge" that turns into human like learning after feedback from "the edge" is used to select and refine the best nets.

comment by lsusr · 2020-05-25T18:06:33.713Z · LW(p) · GW(p)

NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.

This is very important. I plan to follow up with another post about the necessary role of hypothesis amplification in AGI.

Edit: Done. [LW · GW]

comment by MoritzG · 2020-05-25T06:45:36.113Z · LW(p) · GW(p)

I came across this:

The New Dawn of AI: Federated Learning

" This [edge] update is then averaged with other user updates to improve the shared model."

I do not know how that is meant but when I hear the word "average" my alarms always sound.

Instead of a shared NN each device should get multiple slightly different NNs/weights and report back which set was worst/unfit and which best/fittest.
Each set/model is a hypothesis and the test in the world is a evolutionary/democratic falsification.
Those mutants who fail to satisfy the most customers are dropped.

comment by lsusr · 2020-05-25T18:03:58.280Z · LW(p) · GW(p)

NNs are a big data approach, tuned by gradient descent. Because NNs are a big data approach, every update is necessarily small (in the mathematical sense of first-order approximations). When updates are small like this, averaging is fine. Especially considering how most neural networks use sigmoid activation functions.

While this averaging approach can't solve small data problems, it is perfectly suitable to today's NN applications where things tend to be well-contained, without fat tails. This approach works fine within the traditional problem domain of neural networks.

comment by eapache (evan-huus) · 2020-05-15T02:28:06.640Z · LW(p) · GW(p)

Would it be a reasonable interpretation here to read "beauty" as in some sense the inverse of Shannon entropy?

comment by lsusr · 2020-05-15T04:04:45.655Z · LW(p) · GW(p)

Yes.

comment by romeostevensit · 2020-05-15T05:10:39.218Z · LW(p) · GW(p)

oh, this made me realize that sample complexity reduction is fundamentally forward looking, as the lossy compression is a tacit set of predictions about what is worthwhile. i.e. guesses about sensitivity of your model. Interesting.