Posts
Comments
I have originally developed a machine learning notion which I call an LSRDR (
-spectral radius dimensionality reduction), and LSRDRs (and similar machine learning models) behave mathematically and they have a high level of interpretability which should be good for AI safety. Here, I am giving an example of how LSRDRs behave mathematically and how one can get the most out of interpreting an LSRDR.
Suppose that is a natural number. Let denote the quantum channel that takes an qubit quantum state and selects one of those qubits at random and send that qubit through the completely depolarizing channel (the completely depolarizing channel takes a state as input and returns the completely mixed state as an output).
If are complex matrices, then define superoperators and by setting
and for all .
Given tuples of matrices , define the L_2-spectral radius similarity between these tuples of matrices by setting
.
Suppose now that are matrices where . Let . We say that a tuple of complex by matrices is an LSRDR of if the quantity is locally maximized.
Suppose now that are complex -matrices and is an LSRDR of . Then my computer experiments indicate that there will be some constant where is similar to a positive semidefinite operator with eigenvalues and where the eigenvalue has multiplicity where denotes the binomial coefficient. I have not had a chance to try to mathematically prove this. Hooray. We have interpreted the LSRDR of , and I have plenty of other examples of interpreted LSRDRs.
We also have a similar pattern for the spectrum of . My computer experiments indicate that there is some constant where has spectrum where the eigenvalue has multiplicity .
In this note, I will continue to demonstrate not only the ways in which LSRDRs (-spectral radius dimensionality reduction) are mathematical but also how one can get the most out of LSRDRs. LSRDRs are one of the types of machine learning that I have been working on, and LSRDRs have characteristics that tell us that LSRDRs are often inherently interpretable which should be good for AI safety.
Suppose that is the quantum channel that maps a qubit state to a qubit state where we select one of the 6 qubits at random and send it through the completely depolarizing channel (the completely depolarizing channel takes a state as an input and returns the completely mixed state as an output). Suppose that are by matrices where has the Kraus representation .
The objective is to locally maximize the fitness level where the norm in question is the Euclidean norm and where denotes the spectral radius. This is a 1 dimensional case of an LSRDR of the channel .
Let when is selected to locally maximize the fitness level. Then my empirical calculations show that there is some where is positive semidefinite with eigenvalues and where the eigenvalue has multiplicity which is the binomial coefficient. But these are empirical calculations for select values ; I have not been able to mathematically prove that this is always the case for all local maxima for the fitness level (I have not tried to come up with a proof).
Here, we have obtained a complete characterization of up-to-unitary equivalence due to the spectral theorem, so we are quite close to completely interpreting the local maximum for our fitness function.
I made a few YouTube videos showcasing the process of maximizing the fitness level here.
Spectra of 1 dimensional LSRDRs of 6 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 7 qubit noise channel during training
Spectra of 1 dimensional LSRDRs of 8 qubit noise channel during training
I will make another post soon about more LSRDRs of a higher dimension of the same channel .
I personally like my machine learning algorithms to behave mathematically especially when I give them mathematical data. For example, a fitness function with apparently one local maximum value is a mathematical fitness function. It is even more mathematical if one can prove mathematical theorems about such a fitness function or if one can completely describe the local maxima of such a fitness function. It seems like fitness functions that satisfy these mathematical properties are more interpretable than the fitness functions which do not, so people should investigate such functions for AI safety purposes.
My notion of an LSRDR is a notion that satisfies these mathematical properties. To demonstrate the mathematical behavior of LSRDRs, let's see what happens when we take an LSRDR of the octonions.
Let denote either the field of real numbers or the field of complex numbers (
could also be the division ring of quaternions, but for simplicity, let's not go there). If are -matrices over , then an LSRDR (-spectral radius dimensionality reduction) of is a collection of -matrices that locally maximizes the fitness level
. denotes the spectral radius function while denotes the tensor product and denotes the matrix obtained from by replacing each entry with its complex conjugate. We shall call the maximum fitness level the -spectral radius of over the field , and we shall wrote for this spectral radius.
Define the linear superoperator by setting
and set . Then the fitness level of is .
Suppose that is an -dimensional real inner product space. Then the octonionic multiplication operation is the unique up-to-isomorphism bilinear binary operation on together with a unit such that and for all x. If we drop the condition that the octonions have a unit, then we do not quite have this uniqueness result.
We say that an octonion-like algbera is a -dimensional real inner product space together with a unique up-to-isomorphism bilinear operation such that for all .
Let be a specific octonion-like algebra.
Suppose now that is an orthonormal basis for (this does not need to be the standard basis). Then for each , let be the linear operator from to defined by setting for all vectors . All non-zero linear combinations of are conformal mappings (this means that they preserve angles). Now that we have turned the octonion-like algebra into matrices, we can take an LSRDR of the octonion-like algebras, but when taking the LSRDR of octonion-like algebras, we should not worry about the choice of orthonormal basis since I could formulate everything in a coordinate-free manner.
Empirical Observation from computer calculations: Suppose that and is the field of real numbers. Then the following are equivalent.
- The matrices are a LSRDR of over where has a unique real dominant eigenvalue.
- There exists matrices where for all and where is an orthonormal projection matrix.
In this case, and this fitness level is reached by the matrices in the above equivalent statements. Observe that the superoperator is similar to a direct sum of and a zero matrix. But the projection matrix is a dominant eigenvector of and of as well.
I have no mathematical proof of the above fact though.
Now suppose that . Then my computer calculations yield the following complex -spectral radii:
Each time that I have trained a complex LSRDR of , I was able to find a fitness level that is not just a local optimum but also a global optimum.
In the case of the real LSRDRs, I have a complete description of the LSRDRs of . This demonstrates that the octonion-like algebras are elegant mathematical structures and that LSRDRs behave mathematically in a manner that is compatible with the structure of the octonion-like algebras.
I have made a few YouTube videos that animate the process of gradient ascent to maximize the fitness level.
Edit: I have made some corrections to this post on 9/22/2024.
Fitness levels of complex LSRDRs of the octonions (youtube.com)
Here is an example of what might happen. Suppose that for each , we select a orthonormal basis of unit vectors for . Let . Then
Then for each quantum channel , by the concavity of the logarithm function (which is the arithmetic-geometric mean inequality), we have
. Here, equality is reached if and only if
for each , but this equality can be achieved by the channel
defined by which is known as the completely depolarizing channel. This is the channel that always takes a quantum state and returns the completely mixed state. On the other hand, the channel has maximum Choi rank since the Choi representation of is just the identity function divided by the rank. This example is not unexpected since for each input of the possible outputs span the entire space evenly, so one does not have any information about the output from any particular input except that we know that the output could be anything. This example shows that the channels that locally minimize the loss function are the channels that give us a sort of linear regression of but where this linear regression takes into consideration uncertainty in the output so the regression of a output of a state is a mixed state rather than a pure state.
The notion of the linear regression is an interesting machine learning algorithm in the sense that it can be studied mathematically, but the notion of a linear regression is a quite limited machine learning algorithm as most relations are non-linear. In particular, the linear regression does not give us any notion of any uncertainty in the output.
One way to extend the notion of the linear regression to encapsulate uncertainty in the outputs is to regress a function not to a linear transformation mapping vectors to vectors, but to regress the function to a transformation that maps vectors to mixed states. And the notion of a quantum channel is an appropriate transformation that maps vectors to mixed states. One can optimize this quantum channel using gradient ascent.
For this post, I will only go through some basic facts about quantum information theory. The reader is referred to the book The Theory of Quantum Information by John Watrous for all the missing details.
Let be a complex Euclidean space. Let denote the vector space of linear operators from to . Given complex Euclidean spaces , we say that a linear operator from to is a trace preserving if
for all , and we say that is completely positive if there are linear transformations where for all ; the value is known as the Choi rank of . A completely positive trace preserving operator is known as a quantum channel.
The collection of quantum channels from to is compact and convex.
If is a complex Euclidean space, then let
denote the collection of pure states in .
can be defined either as the set of unit vector in modulo linear dependence, or
can be also defined as the collection of positive semidefinite rank- operators on with trace .
Given complex Euclidean spaces and a (multi) set of distinct ordered pairs of unit vectors , and given a quantum channel
, we define the fitness level and the loss level .
We may locally optimize to minimize its loss level using gradient descent, but there is a slight problem. The set of quantum channels spans the which has dimension . Due to the large dimension, any locally optimal will contain many parameters, and this is a large quantity of parameters for what is supposed to be just a glorified version of a linear regression. Fortunately, instead of taking all quantum channels into consideration, we can limit the scope the quantum channels of limited Choi rank.
Empirical Observation: Suppose that are complex Euclidean spaces, is finite and is a positive integer. Then computer experiments indicate that there is typically only one quantum channel of Choi rank at most where is locally minimized. More formally, if we run the experiment twice and produce two quantum channels where is locally minimized for , then we would typically have . We therefore say that when is minimized, is the best Choi rank quantum channel approximation to .
Suppose now that is a multiset. Then we would ideally like to approximate the function better by alternating between the best Choi rank r quantum channel approximation and a non-linear mapping. An ideal choice of a non-linear but partial mapping is the function that maps a positive semidefinite matrix to its (equivalence class of) unit dominant eigenvector.
Empirical observation: If and is the best Choi rank quantum channel approximation to , then let for all , and define . Let be a small open neighborhood of . Let . Then we typically have . More generally, the best Choi rank quantum channel approximation to is typically the identity function.
From the above observation, we see that the vector is an approximation of that cannot be improved upon. The mapping is therefore a trainable approximation to the mapping and since are not even linear spaces (these are complex projective spaces with non-trivial homology groups), the mapping is a non-linear model for the function to .
I have been investigating machine learning models similar to for cryptocurrency research and development as these sorts of machine learning models seem to be useful for evaluating the cryptographic security of some proof-of-work problems and other cryptographic functions like block ciphers and hash functions. I have seen other machine learning models that behave about as mathematically as .
I admit that machine learning models like are currently far from being as powerful as deep neural networks, but since behaves mathematically, the model should be considered as a safer and interpretable AI model. The goal is to therefore develop models that are mathematical like but which can perform more and more machine learning tasks.
(Edited 8/14/2024)
There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.
Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.
Theorem: Suppose that is a real,complex, or quaternionic -matrix that minimizes the quantity . Then is unitary.
Proof: The real case is a special case of a complex case, and by representing each -quaternionic matrix as a complex -matrix, we may assume that is a complex matrix.
By the Schur decomposition, we know that where is a unitary matrix and is upper triangular. But we know that . Furthermore, , so . Let denote the diagonal matrix whose diagonal entries are the same as . Then and . Furthermore, iff T is diagonal and iff is diagonal. Therefore, since and is minimized, we can conclude that , so is a diagonal matrix. Suppose that has diagonal entries . By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that
Here, the equalities hold if and only if for all , but this implies that is unitary. Q.E.D.
I do not care to share much more of my reasoning because I have shared enough and also because there is a reason that I have vowed to no longer discuss except possibly with lots of obfuscation. This discussion that we are having is just convincing me more that the entities here are not the entities I want to have around me at all. It does not do much good to say that the community here is acting well or to question my judgment about this community. It will do good for the people here to act better so that I will naturally have a positive judgment about this community.
You are judging my reasoning without knowing all that went into my reasoning. That is not good.
I will work with whatever data I have, and I will make a value judgment based on the information that I have. The fact that Karma relies on very small amounts of information is a testament to a fault of Karma, and that is further evidence of how the people on this site do not want to deal with mathematics. And the information that I have indicates that there are many people here who are likely to fall for more scams like FTX. Not all of the people here are so bad, but I am making a judgment based on the general atmosphere here. If you do not like my judgment, then the best thing would be to try to do better. If this site has made a mediocre impression on me, then I am not at fault for the mediocrity here.
Let's see whether the notions that I have talked about are sensible mathematical notions for machine learning.
Tensor product-Sometimes data in a neural network has tensor structure. In this case, the weight matrices should be tensor products or tensor sums. Regarding the structure of the data works well with convolutional neural networks, and it should also work well for data with tensor structure to it.
Trace-The trace of a matrix measures how much the matrix maps vectors onto themselves since
where follows the multivariate normal distribution.
Spectral radius-Suppose that we are iterating a smooth function . Suppose furthermore that and is near and . We would like to determine whether or not. If the Jacobian of at has spectral radius less than , then ,. If the Jacobian of at has spectral radius greater than , then this limit does not converge.
The notions that I have been talking about are sensible and arise in machine learning. And understanding these notions is far easier than trying to interpret very large networks like GPT-4 without using these notions. Many people on this site just act like clowns. Karma is only a good metric when the people on the network value substance over fluff. And the only way to convince me otherwise will be for the people here to value posts that involve basic notions like the trace, eigenvalues, and spectral radius of matrices.
P.S. I can make the trace, determinant, and spectral radius even simpler. These operations are what you get when you take the sum, product, and the maximum absolute value of the eigenvalues. Yes. Those are just the basic eigenvalue operations.
Talking about whining and my loss of status is a good way to get me to dislike the LW community and consider them to be anti-intellectuals who fall for garbage like FTX. Do you honestly think the people here should try to interpret large sections of LLMs while simultaneously being afraid of quaternions?
It is better to comment on threads where we are interacting in a more positive manner.
I thought apologizing and recognizing inadequacies was a core rationalist skill. And I thought rationalists were supposed to like mathematics. The lack of mathematical appreciation is one of these inadequacies of the LW community. But instead of acknowledging this deficiency, the community here blasts me as talking about something off topic. How ironic!
I usually think of the field of complex numbers algebraically, but one can also think of the real numbers, complex numbers, and quaternions geometrically. The real numbers are good with dealing with 1 dimensional space, and the complex numbers are good for dealing with 2 dimensional space geometrically. While the division ring of quaternions is a 4 dimensional algebra over the field of real numbers, the quaternions are best used for dealing with 3 dimensional space geometrically.
For example, if are open subsets of some Euclidean space, then a function is said to be a conformal mapping when it preserves angles and the orientation. We can associate the 2-dimensional Euclidean space with the field of complex numbers, and the conformal mappings between open subsets of 2-dimensional spaces are just the complex differentiable mappings. For the Mandelbrot set, we need this conformality because we want the Mandelbrot set to look pretty. If the complex differentiable maps were not conformal, then the functions that we iterate in complex dynamics would stretch subsets of the complex plane in one dimension and expand them in the other dimension and this would result in a fractal that looks quite stretched in one real dimension and squashed in another dimension (the fractals would look like spaghetti; oh wait, I just looked at a 3D fractal and it looks like some vegetable like broccoli). This stretching and squashing is illustrated by 3D fractals that try to mimic the Mandelbrot set but without any conformality. The conformality is why the Julia sets are sensible (mathematicians have proven theorems about these sets) for any complex polynomial of degree 2 or greater.
For the quaternions, it is well-known that the dot product and the cross product operations on 3 dimensional space can be described in terms of the quaternionic multiplication operation between purely imaginary quaternions.
Um. If you want to convince a mathematician like Terry Tao to be interested in AI alignment, you will need to present yourself as a reasonably competent mathematician or related expert and actually formulate an AI problem in such a way so that someone like Terry Tao would be interested in it. If you yourself are not interested in the problem, then Terry Tao will not be interested in it either.
Terry Tao is interested in random matrix theory (he wrote the book on it), and random matrix theory is somewhat related to my approach to AI interpretability and alignment. If you are going to send these problems to a mathematician, please inform me about this before you do so.
My approach to alignment: Given matrices , define a superoperator by setting
, and define . Define the -spectral radius of as . Here, is the usual spectral radius.
Define . Here, is either the field of reals, field of complex numbers, or division ring of quaternions.
Given matrices , define
. The value is always a real number in the interval that is a measure of how jointly similar the tuples are. The motivation behind is that is always a real number in (well except when the denominator is zero) that measures how well can be approximated by -matrices. Informally, measures how random are where a lower value of indicates a lower degree of randomness.
A better theoretical understanding of would be great. If and is locally maximized, then we say that is an LSRDR of . Said differently, is an LSRDR of if the similarity is maximized.
Here, the notion of an LSRDR is a machine learning notion that seems to be much more interpretable and much less subject to noise than many other machine learning notions. But a solid mathematical theory behind LSRDRs would help us understand not just what LSRDRs do, but the mathematical theory would help us understand why they do it.
Problems in random matrix theory concerning LSRDRs:
- Suppose that are random matrices (according to some distribution). Then what are some bounds for .
- Suppose that are random matrices and are non-random matrices. What can we say about the spectrum of ? My computer experiments indicate that this spectrum satisfies the circular law, and the radius of the disc for this circular law is proportional to , but a proof of this circular law would be nice.
- Tensors can be naturally associated with collections of matrices. Suppose now that are the matrices associated with a random tensor. Then what are some bounds for .
P.S. By massively downvoting my posts where I talk about mathematics that is clearly applicable to AI interpretability and alignment, the people on this site are simply demonstrating that they need to do a lot of soul searching before they annoy people like Terry Tao with their lack of mathematical expertise.
P.P.S. Instead of trying to get a high profile mathematician like Terry Tao to be interested in problems, it may be better to search for a specific mathematician who is an expert in a specific area related to AI alignment since it may be easier to contact a lower profile mathematician, and a lower profile mathematician may have more specific things to say and contribute. You are lucky that Terry Tao is interested in random matrix theory, but this does not mean that Terry Tao is interested in anything in the intersection between alignment and random matrix theory. It may be better to search harder for mathematicians who are interested in your specific problems.
P.P.P.S. To get more mathematicians interested in alignment, it may be a good idea to develop AI systems that behave much more mathematically. Neural networks currently do not behave very mathematically since they look like the things that engineers would come up with instead of mathematicians.
P.P.P.P.S. I have developed the notion of an LSRDR for cryptocurrency research because I am using this to evaluate the cryptographic security of cryptographic functions.
We can use the spectral radius similarity to measure more complicated similarities between data sets.
Suppose that are -real matrices and are -real matrices. Let denote the spectral radius of and let denote the tensor product of with . Define the -spectral radius by setting , Define the -spectral radius similarity between and as
.
We observe that if is invertible and is a constant, then
Therefore, the -spectral radius is able to detect and measure symmetry that is normally hidden.
Example: Suppose that are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation with for all (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then is a measure of low close we are to obtaining such an affine transformation . We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that are both the first few layers in a (typically different) neural network. Suppose that is a set of data points. Then if and , then is a measure of the similarity between and .
I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of of ordered pairs and I trained the neural networks to minimize the expected losses . In my experiment, each was a random vector of length 32 whose entries were 0's and 1's. In my experiment, the similarity was worse than if were just random vectors.
This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the -spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).
I am curious about your statement that all large neural networks are isomorphic or nearly isomorphic and therefore have identical loss values. This should not be too hard to test.
Let be training data sets. Let be neural networks. First train on and on . Then slowly switch the training sets, so that we eventually train both and on just . After fully training and , one should be able to train an isomorphism between the networks and (here I assume that and are designed properly so that they can produce such an isomorphism) so that the value for each node in can be perfectly computed from each node in . Furthermore, for every possible input, the neural networks should give the exact same output. If this experiment does not work, then one should be able to set up another experiment that does actually work.
I have personally trained many ML systems for my cryptocurrency research where after training two systems on the exact same data but with independent random initializations, the fitness levels are only off by a floating point error of about , and I am able to find an exact isomorphism between these systems (and sometimes they are exactly the same and I do not need to find any isomorphism). But I have designed these ML systems to satisfy these properties along with other properties, and I have not seen this with neural networks. In fact, the property of attaining the exact same fitness level is a bit fragile.
I found a Bachelor's thesis (people should read these occasionally; I apologize for selecting a thesis from Harvard) where someone tried to find an isomorphism between 1000 small trained machine learning models, and no such isomorphism was found.
https://dash.harvard.edu/bitstream/handle/1/37364688/SORENSEN-SENIORTHESIS-2020.pdf?sequence=1
Or maybe one can find a more complicated isomorphism between neural networks since a node permutation is quite oversimplistic.
I have made a few minor and mostly cosmetic edits to the post about the dimensionality reduction of tensors that produces so many trace free matrices and also to the post about using LSRDRs to solve a combinatorial graph theory problem.
"What's the problem?"-Neural networks are horribly uninterpretable, so it would be nice if we could use more interpretable AI models or at least better interpretability tools. Neural networks seem to include a lot of random information, so it would be good to use AI models that do not include so much random information. Do you think that we would have more interpretable models by forsaking all mathematical theory?
"what does this get us?"-This gets us systems trained by gradient ascent that behave much more mathematically. Mathematical AI is bound to be highly interpretable.
The downvotes display a very bad attitude, and they indicate that the LW community is a community that I really do not want much to do with at worst, and at best, the LW community is a community that lacks discipline and such mathematics texts will be needed to instill such discipline. In those posts that you have looked at, I did not include any mathematical proofs (these are empirical observations, so I could not include proof), and the lack of mathematical proofs makes the text much easier to go through. I also made the texts quite short; I only included enough text to pretty much define the fitness function and then state what I have observed.
For toy examples, I just worked with random complex matrices, and I wanted these matrices to be sufficiently small so that I can make and run the code to compute with these matrices quite quickly, but these matrices need to be large enough so that I can properly observe what is going on. I do not want to make an observation about tiny matrices that do not have any relevance to what is going on in the real world.
If we want to be able to develop safer AI systems, we will need to make them much more mathematical, and people are doing a great disservice by hating the mathematics needed for developing these safer AI systems.
I would go further than this. Future architectures will not only be designed for improved performance, but they will be (hopefully) increasingly designed to optimize safety and interpretability as well, so they will likely be much different than the architectures we see today. It seems to me (this is my personal opinion based on my own research for cryptocurrency technologies, so my opinion does not match anyone else's opinion) that non-neural network machine learning models (but which are probably still trained by moving in the direction of a vector field) or at least safer kinds of neural network architectures are needed. The best thing to do will probably to work on alignment, interpretability, and safety for all known kinds of AI models and develop safer AI architectures. Since future systems will be designed not just for performance but for alignability, safety, and interpretability as well, we may expect for these future systems to be easier to align than systems that are simply designed for performance.
The -spectral radius similarity is not transitive. Suppose that are -matrices and are real -matrices. Then define . Then the generalized Cauchy-Schwarz inequality is satisfied:
.
We therefore define the -spectral radius similarity between and as . One should think of the -spectral radius similarity as a generalization of the cosine similarity between vectors . I have been using the -spectral radius similarity to develop AI systems that seem to be very interpretable. The -spectral radius similarity is not transitive.
and
, but can take any value in the interval .
We should therefore think of the -spectral radius similarity as a sort of least upper bound of -valued equivalence relations than a -valued equivalence relation. We need to consider this as a least upper bound because matrices have multiple dimensions.
Notation: is the spectral radius. The spectral radius is the largest magnitude of an eigenvalue of the matrix . Here the norm does not matter because we are taking the limit. is the direct sum of matrices while denotes the Kronecker product of matrices.
I appreciate your input. I plan on making more posts like this one with a similar level of technical depth. Since I included a proof with this post, this post contained a bit more mathematics than usual. With that being said, others have stated that I should be aware of the mathematical prerequisites for posts like this, so I will keep the mathematical prerequisites in mind.
Here are some more technical thoughts about this.
- We would all agree that the problem of machine learning interpretability is a quite difficult problem; I believe that the solution to the interpretability problem requires us not only to use better interpretability tools, but the machine learning models themselves need to be more inherently interpretable. MPO word embeddings and similar constructions have a little bit (but not too much) of difficulty since one needs to get used to different notions. For example, if we use neural networks using ReLU activation (or something like that), then one has less difficulty upfront, but when it comes time to interpret such a network, the difficulty in interpretability will increase since neural networks with ReLU activation do not seem to have the right interpretability properties, so I hesitate to interpret neural networks. And even if we do decide to interpret neural networks, the interpretability tools that we use may have a more complicated design than the networks themselves.
- There are some good reasons why complex numbers and quaternions have relatively little importance in machine learning. And these reasons do not apply to constructions like MPO word embeddings.
- Since equal norm tight frames are local minimizers of the frame potential, it would help to have a good understanding of the frame potential. For simplicity, it is a good idea to only look at the real case. The frame potential is a potential for a force between a collection of particles on the sphere where particles are repelled from each other (and from each other's antipodal point) and where the force tries to make all the particles orthogonal to each other. If , then it is possible to make all of the particles orthogonal to each other, and in this case, when we minimize this potential, the equal norm tight frames will simply be orthonormal bases. In the case when , we cannot make all of the particles orthogonal to each other, but we can try to get as close as possible. Observe that unlike the Newtonian and logarithmic potential, the frame potential does not have a singularity for when the two particles over lap. I will leave it to you to take the gradient (at least in the real case) of the frame potential to see exactly what this force does to the particles.
- Training an MPO word embedding with the complex numbers of quaternions is actually easier in the sense that for real MPO word embeddings, one needs to use a proper initialization, but with complex and quaternionic MPO word embeddings, an improper initialization will only result in minor deficiencies in the MPO word embedding. This means that the quaternions and complex numbers are easier to work with for MPO word embeddings than the real numbers. In hindsight, the solution to the problem of real MPO word embeddings is obvious, but at the time, I thought that I must use complex or quaternionic matrices.
- I like the idea of making animations, but even in the real case where things are easy to visualize, the equal norm tight frames are non-unique and they may involve many dimensions. The non-uniqueness will make it impossible to interpret the equal norm tight frames; for the same reason, it is hard to interpret what is happening with neural networks since if you retrain a neural network with a different initialization or learning rate, you will end up with a different trained network, but MPO word embeddings have much more uniqueness properties that make them easier to interpret. I have made plenty of machine learning training animations and have posted these animations on YouTube and TikTok, but it seems like in most cases, the animation still needs to be accompanied by technical details; with just an animation, the viewers can see that something is happening with the machine learning model, but they need both the animation and technical details to interpret what exactly is happening. I am afraid that most viewers just stick with the animations without going into so many technical details. I therefore try to make the animations more satisfying than informative most of the time.
If you have any questions about the notation or definitions that I have used, you should ask about it in the mathematical posts that I have made and not here. Talking about it here is unhelpful, condescending, and it just shows that you did not even attempt to read my posts. That will not win you any favors with me or with anyone who cares about decency.
Karma is not only imperfect, but Karma has absolutely no relevance whatsoever because Karma can only be as good as the community here.
P.S. Asking a question about the notation does not even signify any lack of knowledge since a knowledgeable person may ask questions about the notation because the knowledgeable person thinks that the post should not assume that the reader has that background knowledge.
P.P.S. I got downvotes, so I got enough engagement on the mathematics. The problem is the community here thinks that we should solve problems with AI without using any math for some odd reason that I cannot figure out.
I am pointing out something wrong with the community here. The name of this site is LessWrong. On this site, it is better to acknowledge wrongdoing so that the people here do not fall into traps like FTX again. If you read the article, you would know that it is better to acknowledge wrongdoing or a community weakness than to double down.
I already did that. But it seems like the people here simply do not want to get into much mathematics regardless of how closely related to interpretability it is.
P.S. If anyone wants me to apply my techniques to GPT, I would much rather see the embedding spaces as more organized objects. I cannot deal very well with words that are represented as vectors of length 4096 very well. I would rather deal with words that are represented as 64 by 64 matrices (or with some other dimensions). If we want better interpretability, the data needs to be structured in a more organized fashion so that it is easier to apply interpretability tools to the data.
"Lesswrong has a convenient numerical proxy-metric of social status: site karma."-As long as I get massive downvotes for talking correctly about mathematics and using it to create interpretable AI systems, we should all regard karma as a joke. Karma can only be as good as the community here.
Let's compute some inner products and gradients.
Set up: Let denote either the field of real or the field of complex numbers. Suppose that are positive integers. Let be a sequence of positive integers with . Suppose that is an -matrix whenever . Then from the matrices , we can define a -tensor . I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the -distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors , namely the inner product of such tensors and the gradient of the inner product with respect to the matrices .
Notation: If are matrices, then let denote the superoperator defined by letting . Let .
Inner product: Here is the computation of the inner product of our tensors.
.
In particular, .
Gradient: Observe that . We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.
.
So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.
Suppose that is either the field of real numbers or the field of complex numbers. Suppose that are positive integers and is a sequence of positive integers with . Suppose that is an -matrix whenever . Then define a tensor .
If , and is a system of matrices that minimizes the value , then is a dimensionality reduction of , and we shall denote let denote the tensor of reduced dimension . We shall call a matrix table to tensor dimensionality reduction of type .
Observation 1: (Sparsity) If is sparse in the sense that most entries in the tensor are zero, then the tensor will tend to have plenty of zero entries, but as expected, will be less sparse than .
Observation 2: (Repeated entries) If is sparse and and the set has small cardinality, then the tensor will contain plenty of repeated non-zero entries.
Observation 3: (Tensor decomposition) Let be a tensor. Then we can often find a matrix table to tensor dimensionality reduction of type so that is its own matrix table to tensor dimensionality reduction.
Observation 4: (Rational reduction) Suppose that is sparse and the entries in are all integers. Then the value is often a positive integer in both the case when has only integer entries and in the case when has non-integer entries.
Observation 5: (Multiple lines) Let be a fixed positive even number. Suppose that is sparse and the entries in are all of the form for some integer and . Then the entries in are often exclusively of the form as well.
Observation 6: (Rational reductions) I have observed a sparse tensor all of whose entries are integers along with matrix table to tensor dimensionality reductions of where .
This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.
From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.
So in my research into machine learning algorithms that I can use to evaluate small block ciphers for cryptocurrency technologies, I have just stumbled upon a dimensionality reduction for tensors in tensor products of inner product spaces that according to my computer experiments exists, is unique, and which reduces a real tensor to another real tensor even when the underlying field is the field of complex numbers. I would not be too surprised if someone else came up with this tensor dimensionality reduction before since it has a rather simple description and it is in a sense a canonical tensor dimensionality reduction when we consider tensors as homogeneous non-commutative polynomials. But even if this tensor dimensionality reduction is not new, this dimensionality reduction algorithm belongs to a broader class of new algorithms that I have been studying recently such as LSRDRs.
Suppose that is either the field of real numbers or the field of complex numbers. Let be finite dimensional inner product spaces over with dimensions respectively. Suppose that has basis . Given , we would sometimes want to approximate the tensor with a tensor that has less parameters. Suppose that is a sequence of natural numbers with . Suppose that is a matrix over the field for and . From the system of matrices , we obtain a tensor . If the system of matrices locally minimizes the distance , then the tensor is a dimensionality reduction of which we shall denote by .
Intuition: One can associate the tensor product with the set of all degree homogeneous non-commutative polynomials that consist of linear combinations of the monomials of the form . Given, our matrices , we can define a linear functional by setting . But by the Reisz representation theorem, the linear functional is dual to some tensor in . More specifically, is dual to . The tensors of the form are therefore the
Advantages:
- In my computer experiments, the reduced dimension tensor is often (but not always) unique in the sense that if we calculate the tensor twice, then we will get the same tensor. At least, the distribution of reduced dimension tensors will have low Renyi entropy. I personally consider the partial uniqueness of the reduced dimension tensor to be advantageous over total uniqueness since this partial uniqueness signals whether one should use this tensor dimensionality reduction in the first place. If the reduced tensor is far from being unique, then one should not use this tensor dimensionality reduction algorithm. If the reduced tensor is unique or at least has low Renyi entropy, then this dimensionality reduction works well for the tensor .
- This dimensionality reduction does not depend on the choice of orthonormal basis . If we chose a different basis for each , then the resulting tensor of reduced dimensionality will remain the same (the proof is given below).
- If is the field of complex numbers, but all the entries in the tensor happen to be real numbers, then all the entries in the tensor will also be real numbers.
- This dimensionality reduction algorithm is intuitive when tensors are considered as homogeneous non-commutative polynomials.
Disadvantages:
- This dimensionality reduction depends on a canonical cyclic ordering the inner product spaces .
- Other notions of dimensionality reduction for tensors such as the CP tensor dimensionality reduction and the Tucker decompositions are more well-established, and they are obviously attempted generalizations of the singular value decomposition to higher dimensions, so they may be more intuitive to some.
- The tensors of reduced dimensionality have a more complicated description than the tensors in the CP tensor dimensionality reduction.
Proposition: The set of tensors of the form does not depend on the choice of bases .
Proof: For each , let be an alternative basis for . Then suppose that for each . Then
. Q.E.D.
A failed generalization: An astute reader may have observed that if we drop the requirement that , then we get a linear functional defined by letting
. This is indeed a linear functional, and we can try to approximate using a the dual to , but this approach does not work as well.
Thanks for pointing that out. I have corrected the typo. I simply used the symbol for two different quantities, but now the probability is denoted by the symbol .
Every entry in a matrix counts for the -spectral radius similarity. Suppose that are real -matrices. Set . Define the -spectral radius similarity between and to be the number
. Then the -spectral radius similarity is always a real number in the interval , so one can think of the -spectral radius similarity as a generalization of the value where are real or complex vectors. It turns out experimentally that if are random real matrices, and each is obtained from by replacing each entry in with with probability , then the -spectral radius similarity between and will be about . If , then observe that as well.
Suppose now that are random real matrices and are the submatrices of respectively obtained by only looking at the first rows and columns of . Then the -spectral radius similarity between and will be about . We can therefore conclude that in some sense is a simplified version of that more efficiently captures the behavior of than does.
If are independent random matrices with standard Gaussian entries, then the -spectral radius similarity between and will be about with small variance. If are random Gaussian vectors of length , then will on average be about for some constant , but will have a high variance.
These are some simple observations that I have made about the spectral radius during my research for evaluating cryptographic functions for cryptocurrency technologies.
The problem of unlearning would be solved (or kind of solved) if we just used machine learning models that optimize fitness functions that always converged to the same local optimum regardless of the initial conditions (pseudodeterministic training) or at least has very few local optima. But this means that we will have to use something other than neural networks for this and instead use something that behaves much more mathematically. Here the difficulty is to construct pseudodeterministically trained machine learning models that can perform fancy tasks about as efficiently as neural networks. And, hopefully we will not have any issues with a partially retrained pseudodeterministically trained ML model remembering just enough of the bad thing to do bad stuff.
The cryptocurrency sector is completely and totally unable to see any merit in using cryptocurrency mining to solve a scientific problem regardless of the importance of the scientific problem or the complete lack of drawbacks from using such a scientific problem as their mining algorithm. Yes. They would rather just see Bitcoin mining waste as much resources as possible than put those resources to good use. Since the cryptocurrency sector lacks the characteristics that should be desirable, FTX should not have surprised anyone.
I think that all that happened here was the matrices just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.
Suppose that are natural numbers. Let . Let be a complex number whenever . Let be the fitness function defined by letting . Here, denotes the spectral radius of a matrix while denotes the Schatten -norm of .
Now suppose that is a tuple that maximizes . Let be the fitness function defined by letting . Then suppose that is a tuple that maximizes . Then we will likely be able to find an and a non-zero complex number where .
In this case, represents the training data while the matrices is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model without any distortion to the data values.
I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.
I forgot to mention another source of difficulty in getting the energy efficiency of the computation down to Landauer's limit at the CMB temperature.
Recall that the Stefan Boltzmann equation states that the power being emitted from an object by thermal radiation is equal to . Here, stands for power, is the surface area of the object, is the emissivity of the object ( is a real number with ), is the temperature, and is the Stefan-Boltzmann constant. Here, .
Suppose therefore that we want a Dyson sphere with radius that maintains a temperature of 4 K which is slightly above the CMB temperature. To simplify the calculations, I am going to ignore the energy that the Dyson sphere receives from the CMB so that I obtain a lower bound for the size of our Dyson sphere. Let us assume that our Dyson sphere is a perfect emitter of thermal radiation so that .
Earth's surface has a temperature of about . In order to have a temperature of , our Dyson sphere needs to receive the energy per unit of area. This means that the Dyson sphere needs to have a radius of about astronomical units (recall that the distance from Earth to the sun is 1 astronomical unit).
Let us do more precise calculations to get a more exact radius of our Dyson sphere.
, so which is about 15 percent of a light-year. Since the nearest star is 4 light years away, by the time that we are able to construct a Dyson sphere with a radius that is about 15 percent of a light year, I think that we will be able to harness energy from other stars such as Alpha Centauri.
The fourth power in the Stefan Boltzmann equation makes it hard for cold objects to radiate heat.
This post uses the highly questionable assumption that we will be able to produce a Dyson sphere that can maintain a temperature at the level of the cosmic microwave background before we will be able to use energy efficient reversible computation to perform operations that cost much less than energy. And this post also makes the assumption that we will achieve computation at the level of about per bit deletion before we will be able to achieve reversible computation. And it gets difficult to overcome thermal noise at an energy level well above regardless of the type of hardware that one uses. At best, this post is an approximation for the computational power of a Dyson sphere that may be off by some orders of magnitude.
Let \(X,Y\) be topological spaces. Then a function \(f:X\rightarrow Y\) is continuous if and only if whenever \((x_d)_{d\in D}\) is a net that converges to the point \(x\), the net \((f(x_d))_{d\in D}\) also converges to the point \(f(x)\). This is not very hard to prove. This means that we do not have to discuss as to whether continuity should be defined in terms of open sets instead of limits because both notions apply to all topological spaces. If anything, one should define continuity in terms of closed sets instead of open sets since closed generalize slightly better to objects known as closure systems (which are like topological spaces, but we do not require the union of two closed sets to be closed). For example, the collection of all subgroups of a group is a closure system, but the complements of the subgroups of a group have little importance, so if we want the definition that makes sense in the most general context, closed sets behave better than open sets. And as a bonus, the definition of continuity works well when we are taking the inverse image of closed sets and when we are taking the closure of the image of a set.
With that being said, the good thing about continuity is that it has enough characterizations so that at least one of these characterizations is satisfying (and general topology texts should give all of these characterizations even in the context of closure systems so that the reader can obtain such satisfaction with the characterization of his or her choosing).
I have heard of filters and ultrafilters, but I have never heard of anyone calling any sort of filter a hyperfilter. Perhaps it is because the ultrafilters are used to make fields of hyperreal numbers, so we can blame this on the terminology. Similarly, the uniform spaces where the hyperspace is complete are called supercomplete instead of hypercomplete.
But the reason why we need to use a filter instead of a collection of sets is that we need to obtain an equivalence relation.
Suppose that is an index set and is a set with for . Then let be a collection of subsets of . Define a relation on by setting if and only if . Then in order for to be an equivalence relation, must be reflexive, symmetric, and transitive. Observe that is always symmetric, and is reflexive precisely when .
Proposition: The relation is transitive if and only if is a filter.
Proof:
Suppose that is a filter. Then whenever , we have
, so since
, we conclude that as well. Therefore, .
. Suppose now that . Then let let where denotes the characteristic function. Then and . Therefore,, so by transitivity, as well, hence .
Suppose now that and . Let and set .
Observe that and . Therefore, . Thus, by transitivity, we know that . Therefore, . We conclude that is closed under taking supersets. Therefore, is a filter.
Q.E.D.
Yes. We have 2=[(2,2,2,...)]. But we can compare 2 with (1,3,1,3,1,3,...) since (1,3,1,3,1,3,1,3,...)=1 (this happens when the set of all even natural numbers is in your ultrafilter) or (1,3,1,3,1,3,1,3,...)=3 (this happens when the set of all odd natural numbers is in your ultrafilter). Your partially ordered set is actually a linear ordering because whenever we have two sequences , one of the sets
is in your ultrafilter (you can think of an ultrafilter as a thing that selects one block out of every partition of the natural numbers into finitely many pieces), and if your ultrafilter contains
, then .
I trained a (plain) neural network on a couple of occasions to predict the output of the function where are bits and denotes the XOR operation. The neural network was hopelessly confused despite the fact that neural networks usually do not have any trouble memorizing large quantities of random information. This time the neural network could not even memorize the truth table for XOR. While the operation is linear over the field , it is quite non-linear over . The inability for a simple neural network to learn this function indicates that neural networks are better at learning when they are not required to stray too far away from linearity.
Neural networks with ReLU activation are the things you obtain when you combine two kinds of linearity, namely the standard linearity that we all should be familiar with and tropical linearity.
Give two operations defined by setting . Then the operations are associative, commutative, and they satisfy the distributivity property . We shall call the operations tropical operations on .
We can even perform matrix and vector operations by replacing the operations with their tropical counterparts . More explicitly, we can define the tropical matrix addition and multiplication operations by setting
and
.
Here, the ReLU operation is just , and if is the zero vector and is a real vector, then , so ReLU does not even rely on tropical matrix multiplication.
Of course, one can certainly construct and train neural networks using tropical matrix multiplication in the layers of the form where are weight matrices and are bias vectors, but I do not know of any experiments done with these sorts of neural networks, so I am uncertain of what advantages they offer.
Since ReLU neural networks are a combination of two kinds of linearity, one might expect for ReLU neural networks to behave nearly linearly. And it is not surprising that ReLU networks look more like the standard linear transformations than the tropical linear transformations since the standard linear transformations in a neural network are far more complex than the ReLU. ReLU just provides the bare minimum non-linearity for a neural network without doing anything fancy.
I think of tensors as homogeneous non-commutative polynomials. But I have found a way of reducing tensors that does not do anything for 2-tensors but which seems to work well for -tensors where . We can consider tensors as homogeneous non-commutative polynomials in a couple of different ways depending on whether we have tensors in or if we have . Let . Given a homogeneous non-commutative polynomial over the field , consider the fitness function defined by setting (there are generalizations of this fitness function) where denotes the spectral radius of the matrix . By locally maximizing , the matrices encode information about the tensor as long as has degree . This -spectral radius tensor dimensionality reduction seems to be well-behaved in the sense that if one runs this -spectral radius tensor dimensionality reduction multiple times, one will reach the same local maximum, so the notion of a -spectral radius tensor dimensionality reduction is pseudodeterministic. The L_{2,d}-spectral radius tensor dimensionality reduction seems to also be well-behaved in other aspects as well, and I hope that this dimensionality reduction becomes very useful for machine learning and AI safety.
Perhaps it is best to develop AI systems that we can prove theorems about in the first place. AI systems that we can prove theorems about are more likely to be interpretable anyways. Fortunately, there are quite a few theorems about maxima and minima of functions including uniqueness theorems including the following.
Theorem: (maximum principle) If is a compact set, and is an upper semicontinuous function that is subharmonic on the interior , then .
If is a bounded domain, and is a -function, then define the Dirichlet energy of as
.
Theorem: (Dirichlet principle) If is a compact subset of , then whenever is continuous and is a continuous function which is harmonic on and where , then is the -function that minimizes the Dirichlet energy subject to the condition that .
Theorem: (J Van Name: A version of the Cauchy-Schwarz inequality) Suppose that which have no invariant subspace and . Suppose furthermore that is the partial function where . Then and if and only if there is some non-zero complex number and invertible matrix with for . Here, denotes the spectral radius of a matrix .
Right now, there are some AI systems related to the above theorems. For example, if we are given a graph, then the smallest eigenvalues of the Laplacian of a graph form clusters in your graph, so spectral techniques can be used to analyze graphs, but I am working on developing spectral AI systems with more advanced capabilities. It will take some time for spectral AI systems to catch up with deep neural networks while retaining their mathematical properties (such as fitness functions with an almost unique local maximum), but it is plausible that the development of spectral AI systems is mostly a matter of engineering and hard work rather than a problem of possibility, but I am not sure whether spectral AI systems will be as energy efficient as neural networks once well developed.
A double exponential model seems very questionable. Is there any theoretical reason why you chose to fit your model with a double exponential? When fitting your model using a double exponential, did you take into consideration fundamental limits of computation? One cannot engineer transistors to be smaller than atoms, and we are approaching the limit to the size of transistors, so one should not expect very much of an increase in the performance of computational hardware. We can add more transistors to a chip by stacking layers (I don't know how this would be manufactured, but 3D space has a lot of room), but the important thing is efficiency, and one cannot make the transistors more efficient by stacking more layers. With more layers in 3D chips, most transistors will just be off most of the time, so 3D chips provide a limited improvement.
Landauer's principle states that to delete a bit of information in computing, one must spend at least energy where is Boltzmann's constant, is the temperature, and . Here, (Joules per Kelvin) which is not a lot of energy at room temperature. As the energy efficiency of computation approaches Landauer's limit, one runs into problems such as thermal noise. Realistically, one should expect to spend more than energy per bit deletion in order to overcome thermal noise (and this is ). If one tries to avoid Landauer's limit using reversible computation, then the process of computation becomes more complicated, so with reversible computation, one trades energy efficiency per bit operation with the number of operations computed, and the amount of space one uses in performing that computation. The progress in computational hardware capabilities will slow down as one progresses from classical computation to reversible computation. There are also ways of cutting the energy efficiency of deletion of information from per bit to something much closer to , but they seem like a complicated engineering challenge.
The Margolus-Levitin theorem states that it takes energy to go from a quantum state to an orthogonal quantum state (by flipping a bit, one transforms a state into an orthogonal state) where is Planck's constant ( (Joules times seconds)) and is the energy. There are other fundamental limits to the capabilities of computation.
I am not expecting any worldwide regulation on AI that prohibits people from using or training unaligned systems (I am just expecting a usual level of regulation). I am mainly hoping for spectral techniques to develop to the point where AI groups will want to use these spectral techniques (or some other method) more and more until they are competitive with neural networks at general tasks or at least complement the deficiencies of neural networks. I also hope that these spectral techniques will remain interpretable and aligned.
Right now, there are several kinds of tasks in which I would rather use spectral techniques than neural networks. I have been evaluating the cryptographic security of block ciphers with small message size and very small key size (for cryptocurrency research), and it seems like the spectral techniques that I have developed give consistent measures of security for such block ciphers (I am not done with the training yet) and these spectral techniques are better for cryptanalysis than neural networks. I have been able to use these spectral techniques for other problems such as the problem of finding the largest clique in a graph (this is not something that I would have expected before I did it), and right now these spectral techniques are the only way I know how to transform a non-commutative polynomial into something that other machine learning models can work with better.
Right now, I do not know how to use spectral techniques to replace deep neural networks. I do not know how to use spectral techniques to approximate a universal function and I do not know how to use spectral techniques to make machine learning models with many layers. I hope to be able to solve these problems of spectral techniques, but I agree that there will be a tradeoff between performance and interpretability. The goal is to make this tradeoff favor interpretable, aligned, and safe machine learning models.
I agree that interpretability research is risky, and that one should carefully consider whether it is worth it to perform this interpretability research. I propose that a safer alternative would be to develop machine learning models that are
- quite unrelated to the highest performing machine learning models (this is so that capability gains in these safer models do not translate very well to capability gains for the high performing models),
- as inherently interpretable as one can make these models,
- more limited in functionality than the highest performing machine learning models, but
- still functional enough to be useful in practice.
These characteristics are correlated with each other, so it is likely that one would get all of these characteristics at once. For example, one will most likely not get interpretability for free but instead trade interpretability for performance. Being unrelated to the highest performing machine learning models will also likely reduce the performance of the machine learning models.
It seems like spectral methods such as PageRank or the eigenvectors of the Laplacian of a graph satisfy some of these characteristics, but it would be good if there were more research into developing spectral methods further to increase their functionality.
Perhaps interpretability research has yielded such few results because we need to first develop inherently interpretable models.
I do not see any evidence that large language models are equipped to understand the structure behind prime numbers. But transformers along with other machine learning tools should be well-equipped to investigate other mathematical structures. In particular, I am thinking about the mathematical structures called Laver-like algebras that I have been researching on and off since about 2015.
I have developed an algorithm that is capable of producing new Laver-like algebras from old ones. From every Laver-like algebra, one can generate a sequence of non-commutative polynomials. From these non-commutative polynomials, one can recover the entire Laver-like algebra up to critical equivalence. Therefore, to use deep learning to investigate the structure of Laver-like algebras, we need a few things.
- We need a way of transforming the non-commutative polynomials into something that further layers of AI can work with. I suggest using my notion of an LSRDR that I developed for cryptographic purposes to transform these non-commutative polynomials into collections of matrices. These matrices are often (but not always) unique up-to-a constant scalar factor and orthogonal/unitary equivalence. But this is simply what I have come up with using my own methods. There are likely other ways that transform non-commutative polynomials into vectors that I have simply not thought of.
- We need a way of solving a machine learning problem from a sequence of vectors or matrices as input. This is where we can use transformers.
- We need a problem or problems to solve along with an algorithm for deciding which Laver-like algebras to go through next. One such problem would be to estimate the number of Laver-like algebras exist that satisfy certain conditions. Another such problem would be to determine whether an algebra arises from the rank-into-rank embeddings or not. Since transformers are good at working with a variety of problems, we may be able to train transformers to do a variety of tasks related to Laver-like algebras.
Let me know if you want more detailed information about this proposal.
While I believe that this is a sensible proposal, I do not believe there will be too much of a market (in the near future) for it for societal reasons. Our current society unfortunately does not have a sufficient appreciation for mathematics and mathematical theorems for this to have much of a market cap. To see why this is the case, we can compare this proposal to the proposal for cryptocurrency mining algorithms that are designed to advance science. I propose that a scientific cryptocurrency mining algorithm should attract a much larger market capitalization than a theorem-proof market, but since scientific cryptocurrencies do not have much of a market, one should not expect for shares of theorems to attract much of a market either.
A market for mathematical conjectures requires participants to not only actively understand the mathematics behind the conjectures, but to also understand these theorems in their complete formalism. Perhaps AI systems may be developed to translate the formalism into English, but such a system may be flawed and may be subject to hacking in the sense that the statement in English may not be exactly the same as the formal statement, and it may be really easy to prove the formal statement. Since such a market requires a substantial amount of effort from the users, it is unlikely that there will be many users who are willing to put in the effort into understanding the mathematics along with the formalism.
It is also likely for the market to favor simple propositions that are easy to understand such as the bounded gaps between primes conjecture. For example, Beal's conjecture states that there are no positive integers with and and where have no common factor. This conjecture was formulated by the billionaire Michael Beal. Beal attached a million dollar prize to this conjecture. This conjecture is a generalization of Fermat's last theorem. Beal's conjecture is a reasonably interesting conjecture, but it seems like Beal's conjecture is the very best mathematics problem that amateur mathematicians can come up with and fund with one's own money. This means that the market for mathematical theorems will be small and will mainly for theorems that anyone can understand such as the Twin Prime conjecture. It would be more satisfying if the market for mathematical theorems better represents the type of problems and questions that your typical mathematician were interested in.
Cryptocurrency mining algorithms can be used to solve scientific problems, but the users of the cryptocurrency do not even have to care very much about the underlying scientific problem. This is because a cryptocurrency is much more than its mining algorithm, and the details about the mining algorithm are relatively unimportant to the functionality of the cryptocurrency. If chosen correctly, the underlying scientific/mathematical problem does not harm or interfere very much with the cryptocurrency, so there is really no excuse for the cryptocurrency community to waste all those resources on cryptocurrencies on non-scientific problems because the cryptocurrency community does not have to. The cryptocurrency community needs to have a desire to not waste energy along with the philosophy that solving the most important scientific problem is profitable (this would be a self-fulfilling prophecy). The cryptocurrency community should also have a basic awareness of mining algorithms that are designed to solve a scientific problem (they do not need to know the specific details). This basic awareness should be part of the due diligence that every cryptocurrency enthusiast should partake in order to try to maximize profit, so this understanding should be natural for the cryptocurrency sector. Since there is still only one dominant minable cryptocurrency (namely Bitcoin), one only needs one scientific mining algorithm, so the cryptocurrency community does not even need to understand multiple scientific mining algorithms. They just need to know one. It is easier to know about one cryptocurrency mining algorithm than it is to know the market value of many theorems.
Right now Bitcoin has a market capitalization of about 500 billion dollars, and if Bitcoin was set up correctly, it could have had a mining algorithm that was designed to solve a very important scientific problem without the cryptocurrency community even thinking about this. Unfortunately, Bitcoin's mining algorithm was never designed to advance science, and at this point, the cryptocurrency community is not very interested (due to anti-intellectualism) in using cryptocurrency technologies to solve any scientific problem. The cryptocurrency community to a great extent has no concept or understanding of Bitcoin or what it means for a cryptocurrency mining algorithm to solve a scientific problem. And those who think they understand what it means for a cryptocurrency mining algorithm to be designed to solve a scientific problem are quite antagonistic to cryptocurrency mining being used for any scientific purpose. Since the cryptocurrency community hates using cryptocurrency mining for a scientific purpose so much, one should expect for any market for proofs of theorems to be very small and undeveloped.
Nobody should expect for there to be a sizable theorem/proof market when that requires ~100 times as much effort from users than a market for cryptocurrencies with scientific mining algorithms and when there is no sizable scientific mining algorithm market.
With everything being said, we make a tradeoff with cryptocurrencies with scientific mining algorithms. Cryptocurrency mining algorithms such as SHA-256 must satisfy stringent conditions, so there are limited options in designing a cryptocurrency with a mining algorithm to advance science, and it is hard to fork a cryptocurrency to replace its scientific problem.
I would rather see people work on making scientific cryptocurrency mining more of a reality than imagining a market that is unlikely to exist any time soon because scientific cryptocurrencies are more feasible with the current state of the world.
You claim that Terry Tao says that Benford's law lack's a satisfactory explanation. That is not correct. Terry Tao actually gave an explanation for Benford's law. And even if he didn't, if you are moderately familiar with mathematics, an explanation for Benford's law would be obvious or at least a standard undergraduate exercise.
Let be a random variable that is uniformly distributed on the interval for some . Let denote the first digit of . Then show that .
For all , let be a uniformly distributed random variable on the interval . Let denote the first digit of . Suppose that . Prove that .
Terry Tao in that blog post lists the characteristics that imply Zipf's law here:
"These laws govern the asymptotic distribution of many statistics which
- (i) take values as positive numbers;
- (ii) range over many different orders of magnitude;
- (iiii) arise from a complicated combination of largely independent factors (with different samples of arising from different independent factors); and
- (iv) have not been artificially rounded, truncated, or otherwise constrained in size."
Terry Tao merely stated that Benford's law cannot be proven the same way mathematical theorems are proven; "Being empirically observed phenomena rather than abstract mathematical facts, Benford’s law, Zipf’s law, and the Pareto distribution cannot be “proved” the same way a mathematical theorem can be proved. However, one can still support these laws mathematically in a number of ways, for instance showing how these laws are compatible with each other, and with other plausible hypotheses on the source of the data."
I made the Latex compile by adding a space. Let me know if there are any problems.
Is the Latex compiling here?