Revisiting the Manifold Hypothesis
post by Aidan Rocke (aidanrocke) · 2023-10-01T23:55:56.704Z · LW · GW · 19 commentsContents
Motivation: The Monte Carlo Hypothesis: The Prime Coding Theorem: Monte Carlo Hypothesis Machine Learning Challenge: Challenge: Training data: Test data: Evaluation: Submission guidelines: Reward: R(p) = $100.00 x p Consequences for Artificial General Intelligence: Yang-Hui He's experiments on the Prime Recognition problem: Physical Intuitions for the Challenge: References: None 19 comments
In the following analysis, a machine learning challenge is proposed by Sasha Kolpakov, Managing Editor of the Journal of Experimental Mathematics, and myself to verify a strong form of the Manifold Hypothesis due to Yoshua Bengio.
Motivation:
In the Deep Learning book, Ian Goodfellow, Yoshua Bengio and Aaron Courville credit the unreasonable effectiveness of Deep Learning to the Manifold Hypothesis as it implies that the curse of dimensionality may be avoided for most natural datasets [5]. If the intrinsic dimension of our data is much smaller than its ambient dimension, then sample-efficient PAC learning is possible. In practice, this happens via the latent space of a deep neural network which auto-encodes the input.
Yoshua Bengio in particular argued that the hierarchical/compositional structure of the human brain exploits the Manifold Hypothesis as an inductive bias [7], so any efficiently computable function that is of interest to humans is most likely PAC-learnable. A strong form of this hypothesis, which is generally assumed by OpenAI and DeepMind, suggests that deep learning may be the ultimate path to AGI.
However, a plausible exception to Bengio's heuristic is the mysterious distribution of prime numbers that has captured the imagination of the best mathematicians of all ages:
Mathematicians have tried in vain to this day to discover some order in the sequence of prime numbers, and we have reason to believe that it is a mystery into which the human mind will never penetrate.-Euler
Though perhaps if Euler had access to GPUs he would have generalised this statement to the minds of machines. In fact, after careful deliberations with Steve Brunton, Marcus Hutter and Hector Zenil regarding the empirical observation that deep learning models fail to approximate the Prime Counting Function...it appears that not all efficiently computable functions that are of human interest are PAC-learnable.
In order to clarify the nature of these observations, Sasha Kolpakov and myself rigorously formulated the Monte Carlo Hypothesis which implies that machine learning alone is not a complete path to human-level intelligence. If Bernhard Riemann knew of the Prime Counting Function, it would have had to be by other means than data compression for reasons that are clarified below.
The Monte Carlo Hypothesis:
The Prime Coding Theorem:
From the definition of the prime encoding where if and otherwise, Kolpakov and Rocke(2023) used Kolmogorov's theory of Algorithmic Probability to derive the Prime Coding theorem [1]:
which implies that the locations of all primes in are statistically independent of each other for large .
Monte Carlo Hypothesis
If the best machine learning model predicts that the next N primes to be at then for large this model's statistical performance will converge to a true positive rate that is no better than:
Hence, the true positive rate for any machine learning model converges to zero.
Machine Learning Challenge:
Challenge:
Given the location of the first N prime numbers, predict the location of the next prime.
Training data:
The prime encoding where .
Test data:
The location of the next thousand prime numbers.
Evaluation:
For the given reasons, we will evaluate reproducible models on ten distinct rearrangements of the prime encoding and calculate the arithmetic mean of its true positive rate.
As the Expected Kolmogorov Complexity of a random variable is asymptotically equivalent to its Shannon Entropy, and Shannon Entropy is permutation invariant:
It follows that the Prime Coding Theorem, and hence the Monte Carlo Hypothesis, is invariant to rearrangements of prime encodings.
Submission guidelines:
Models may be submitted to the tournament-specific email of Alexander Kolpakov, Managing Editor of the Journal of Experimental Mathematics, before March 14, 2024: wignerweyl@proton.me
The first 30 models we receive shall be evaluated, and the top 3 models shall be rewarded.
Reward:
A hundred dollars for each percentage of the true positive rate(p):
R(p) = $100.00 x p
Consequences for Artificial General Intelligence:
Determine whether Machine Learning offers a complete path to human-level intelligence.
Yang-Hui He's experiments on the Prime Recognition problem:
Discussions with Yang-Hui He, a string theorist and number theorist at Oxford, motivated much of the theoretical analysis behind this scientific enterprise. In Deep Learning the Landscape[8], he finds that deep learning models capable of solving sophisticated computer vision problems converge to a true positive rate of no more than 0.1% on the Prime Recognition problem.
In fact, he summarizes these observations in the following manner:
Lest the reader's optimism be elevated to unreasonable heights by the string of successes with the neural networks, it is imperative that we be aware of deep learning's limitations. We therefore finish with a sanity check that a neural network is not some omnipotent magical device, nor an oracle capable of predicting any pattern. An example which must be doomed to failure is the primes(or, for that matter, the zeros of the Riemann zeta function). Indeed, if a neural network could learn some unexpected pattern in the primes, this would be a rather frightening prospect for mathematics.
Physical Intuitions for the Challenge:
In mathematical physics, the Riemann gas is a model in Quantum Statistical Mechanics illustrating deep correspondences between number theory, quantum statistical mechanics and dynamical systems. It helps us understand how the prime numbers describe the eigenspace of a deterministic dynamical system with unbounded phase-space dimension.
Hence, the Prime Recognition problem corresponds to reliably finding a low-dimensional representation of high-dimensional data.
Update: Since the 14th of March, there have been zero submissions and not even late submissions as of the 8th of April 2024. This may be due to the fact that it is practically impossible to do better than the expected true positive rate, as Alexander Kolpakov and myself explain in a recent publication: https://arxiv.org/abs/2403.12588
References:
- Kolpakov, Alexander & Rocke, Aidan. On the impossibility of discovering a formula for primes using AI. arXiv: 2308.10817. 2023.
- A. N. Kolmogorov. Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Surveys, 1983.
- Marcus Hutter. A theory of universal intelligence based on algorithmic complexity.
2000. https://arxiv.org/abs/cs/0004001. - P. Grünwald and P. Vitanyí. Algorithmic information theory. arXiv:0809.2754, 2008.
- Y. Bengio, A. Courville and Ian Goodfellow. Deep Learning. MIT Press. 2016.
- LeCun, Yann et al. “Deep Learning.” Nature 521 (2015): 436-444.
- Bengio, Yoshua et al. “Representation Learning: A Review and New Perspectives.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2012): 1798-1828.
- He, Yang-Hui. Deep Learning the Landscape. arXiv: 1706.02714. 2018.
- D. Spector, Supersymmetry and the Möbius Inversion Function, Communications in Mathematical Physics 127 (1990) pp. 239–252.
- Bernard L. Julia, Statistical theory of numbers, in Number Theory and Physics, eds. J. M. Luck, P. Moussa, and M. Waldschmidt, Springer Proceedings in Physics, Vol. 47, Springer-Verlag, Berlin, 1990, pp. 276–293.
19 comments
Comments sorted by top scores.
comment by Nathaniel Monson (nathaniel-monson) · 2023-10-02T10:47:58.604Z · LW(p) · GW(p)
"implies that machine learning alone is not a complete path to human-level intelligence."
I don't think this is even a little true, unless you are using definitions of human level intelligence and machine learning which are very different than the ideas I have of them.
If you have a human who has never heard of the definition of prime numbers, how do you think they would do on this test? Am I allowed to.supply my model with something equivalent to the definition?
Replies from: aidanrocke, Ilio↑ comment by Aidan Rocke (aidanrocke) · 2023-10-02T12:09:31.123Z · LW(p) · GW(p)
The best physicists on Earth, including Edward Witten and Alain Connes, believe that the distribution of primes and Arithmetic Geometry encode mathematical secrets that are of fundamental importance to mathematical physics. This is why the Langlands program and the Riemann Hypothesis are of great interest to mathematical physicists.
If number theory, besides being of fundamental importance to modern cryptography, allows us to develop a deep understanding of the source code of the Universe then I believe that such advances are a critical part of human intelligence, and would be highly unlikely if the human brain had a different architecture.
Replies from: nathaniel-monson↑ comment by Nathaniel Monson (nathaniel-monson) · 2023-10-03T22:12:50.566Z · LW(p) · GW(p)
I agree with your entire first paragraph. It doesn't seem to me that you have addressed my question though. You are claiming that this hypothesis "implies that machine learning alone is not a complete path to human-level intelligence." I disagree. If I try to design an ML model which can identify primes, is it fair for me to give it some information equivalent to the definition (no more information than a human who has never heard of prime numbers has)?
If you allow that it is fair for me to do so, I think I can probably design an ML model which will do this. If you do not allow this, then I don't think this hypothesis has any bearing on whether ML alone is "a complete path to human-level intelligence." (Unless you have a way of showing that humans who have never received any sensory data other than a sequence of "number:(prime/composite)label" pairs would do well on this.)
Replies from: alexander-kolpakov↑ comment by Alexander Kolpakov (alexander-kolpakov) · 2023-10-03T22:40:09.611Z · LW(p) · GW(p)
Does any ML model that tells cats from dogs get definitions thereof? I think the only input it gets is "picture:(dog/cat)label". It does learn to tell them apart, to some degree, at least. One would expect the same approach here. Otherwise you can ask right away for the sieve of Eratosthenes as a functional and inductive definition, in which case things get easy ...
Replies from: nathaniel-monson↑ comment by Nathaniel Monson (nathaniel-monson) · 2023-10-04T12:39:18.785Z · LW(p) · GW(p)
In that case, I believe your conjecture is trivially true, but has nothing to do with human intelligence or Bengio's statements. In context, he is explicitly discussing low dimensional representations of extremely high dimensional data, and the things human brains learn to do automatically (I would say analogously to a single forward pass).
If you want to make it a fair fight, you either need to demonstrate a human who learns to recognize primes without any experience of the physical world (please don't do this) or allow an ML model something more analogous to the data humans actually receive, which includes math instruction, interacting with the world, many brain cycles, etc
Replies from: alexander-kolpakov, alexander-kolpakov, aidanrocke↑ comment by Alexander Kolpakov (alexander-kolpakov) · 2023-12-01T18:00:51.341Z · LW(p) · GW(p)
I also believe my conjecture is true, however non-trivially. At least, mathematically non-trivially. Otherwise, all is trivial when the job is done.
↑ comment by Alexander Kolpakov (alexander-kolpakov) · 2023-12-01T18:00:41.762Z · LW(p) · GW(p)
I also believe my conjecture is true, however non-trivially. At least, mathematically non-trivially. Otherwise, all is trivial while the job is done.
↑ comment by Aidan Rocke (aidanrocke) · 2023-10-05T11:47:10.061Z · LW(p) · GW(p)
Regarding your remark on finding low-dimensional representations, I have added a section on physical intuitions for the challenge. Here I explain how the prime recognition problem corresponds to reliably finding a low-dimensional representation of high-dimensional data.
↑ comment by Ilio · 2023-10-02T17:57:59.278Z · LW(p) · GW(p)
Let P, P’, P’’= « machine learning alone », « machine learning + noise », « is not a complete path to human-level intelligence »
A few follow up questions: do you also think that P+P’’ == P’+P’’ ? Is your answer proven or more or less uncontroversial ? (ref welcome!)
comment by Mitchell_Porter · 2023-10-02T04:09:42.101Z · LW(p) · GW(p)
the empirical observation that deep learning models fail to approximate the Prime Counting Function
I can't find any empirical work on this...
If Bernhard Riemann knew of the Prime Counting Function, it would have had to be by other means than data compression
He obtained his "explicit formulas" by reasoning about an ideal object (his zeta function) which by construction, contains information about all prime numbers.
Replies from: aidanrocke↑ comment by Aidan Rocke (aidanrocke) · 2023-10-02T08:03:56.424Z · LW(p) · GW(p)
Thank you for bringing up these points:
- Riemann's analysis back then was far from trivial and there were important gaps in his derivation of the explicit formulas for Prime Counting. What appears obvious now was far from obvious then.
- I just appended a summary of Yang-Hui He's experiments on the Prime Recognition problem.
Either way, I believe that additional experiments may be enlightening as the applied mathematics that mathematicians do is only true to the extent that it has verifiable consequences.
Replies from: Mitchell_Porter↑ comment by Mitchell_Porter · 2023-10-28T05:38:56.641Z · LW(p) · GW(p)
This might interest you: a language model is used to develop a model of inflation (expansion in the early universe), using a Kolmogorov-like principle (minimum description length).
Replies from: aidanrocke↑ comment by Aidan Rocke (aidanrocke) · 2023-11-08T14:50:51.598Z · LW(p) · GW(p)
Thank you for sharing this. 👌
comment by Dhruv Ghulati (dhruv-ghulati) · 2023-11-25T22:41:55.170Z · LW(p) · GW(p)
ambient
Explain this with an example?
Replies from: alexander-kolpakov↑ comment by Alexander Kolpakov (alexander-kolpakov) · 2023-12-01T16:06:32.134Z · LW(p) · GW(p)
The data we are looking at may be points (approximately) situated on the surface of a sphere (dimension 2) in the 3-space. However, it could be a much steeper dimension drop. https://openreview.net/forum?id=XJk19XzGq2J
comment by Dhruv Ghulati (dhruv-ghulati) · 2023-11-25T22:41:35.293Z · LW(p) · GW(p)
PAC learning
Explain this.
Replies from: alexander-kolpakov↑ comment by Alexander Kolpakov (alexander-kolpakov) · 2023-12-01T17:55:27.180Z · LW(p) · GW(p)
Here (https://stats.stackexchange.com/questions/142906/what-does-pac-learning-theory-mean) is an accessible explanation. In simple words this would mean that you have a reasonable estimate for the amount of data you need to guarantee that you can learn a concept correctly with high probability.
comment by philip_b (crabman) · 2023-10-02T14:35:26.336Z · LW(p) · GW(p)
This sounds similar to whether a contemporary machine learning model can break a cryptographic cipher, a hash function, or something like that.
Replies from: alexander-kolpakov↑ comment by Alexander Kolpakov (alexander-kolpakov) · 2023-10-03T17:26:46.782Z · LW(p) · GW(p)
Yes and no. Yes, because prime inference with high accuracy would make codebreaking much easier. No, because, for example, in RSA you need to deal with semiprimes, and that setup seems different as per Sam Blake's research here: https://arxiv.org/abs/2308.12290