post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by ZankerH · 2022-01-04T09:57:51.270Z · LW(p) · GW(p)

The issue with MNIST is that everything works on MNIST, even algorithms that utterly fail on a marginally more complicated task. It's a solved problem, and the fact that this algorithm solves it tells you nothing about it.

If the code is too rigid or poorly performant to be tested on larger or different tasks, I suggest F-MNIST (fashion MNIST), which uses the exact same data format, has the same number of categories and amount of data points, but is known to be far more indicative of the true performance of modern machine learning approaches.

 

https://github.com/zalandoresearch/fashion-mnist

Replies from: lsusr
comment by lsusr · 2022-01-04T10:30:07.399Z · LW(p) · GW(p)

I like this idea. It seems to me like a fair test. I will run the code overnight with default settings and see what happens.

Replies from: lsusr
comment by lsusr · 2022-01-04T10:54:04.852Z · LW(p) · GW(p)

Initial results indicate the code performs poorly on F-MNIST. It is possible this is a hyperparameter-tuning issue but my default conclusion is that MNIST (created in 1998, before the invention of modern GPUs) is just too easy.

testing 20000: 0: 44.413  1: 43.77
testing 40000: 0: 23.702  1: 23.34
testing 60000: 0: 21.822  1: 20.92
testing 80000: 0: 38.627  1: 37.75
testing 100000: 0: 25.107  1: 24.85
testing 120000: 0: 28.893  1: 28.34
testing 140000: 0: 29.203  1: 29.50
testing 160000: 0: 40.437  1: 39.55
testing 180000: 0: 49.828  1: 48.59
testing 200000: 0: 39.510  1: 39.81
testing 220000: 0: 39.938  1: 40.10
testing 240000: 0: 32.390  1: 31.65
testing 260000: 0: 25.367  1: 24.51
testing 280000: 0: 29.198  1: 28.66
testing 300000: 0: 29.823  1: 28.91
testing 320000: 0: 35.178  1: 34.20
testing 340000: 0: 32.370  1: 31.94
testing 360000: 0: 29.083  1: 28.31
testing 380000: 0: 30.117  1: 30.05
testing 400000: 0: 39.125  1: 39.21

Replies from: crabman, lsusr, p.b.
comment by philip_b (crabman) · 2022-01-04T15:06:49.475Z · LW(p) · GW(p)

Btw, a multilayer perceptron (which is a permutation invariant model) with 230000 parameters and, AFAIK, no data augmentaiton used, can achieve 88.33% accuracy on FashionMNIST.

Replies from: D𝜋
comment by D𝜋 · 2022-01-06T08:09:42.664Z · LW(p) · GW(p)

I doubt that this would be the best a MLP can achieve on F-MNIST.

I will put it this way: SONNs and MLPs do the same thing, in a different way. Therefore they should achieve the same accuracy. If this SONN can get near 90%, so should MLPs. 

It is likely that nobody has bothered to try 'without convolutions' because it is so old-fashioned.

Convolutions are for repeated locally aggregated correlations.

comment by lsusr · 2022-01-05T03:54:22.108Z · LW(p) · GW(p)

Update: It gets higher if you run it for long enough.

testing 59300000: 0: 58.615  1: 57.36
testing 59320000: 0: 50.902  1: 50.28
testing 59340000: 0: 68.415  1: 66.67
testing 59360000: 0: 71.813  1: 69.36
testing 59380000: 0: 70.275  1: 68.53
testing 59400000: 0: 71.577  1: 68.67
Replies from: D𝜋, lsusr
comment by D𝜋 · 2022-01-05T09:19:02.298Z · LW(p) · GW(p)

Update: 3 runs (2 random) , 10 million steps. All three over 88.33 (average 9.5-10.5 million on the 3: 88.43). New SOTA ? Please check and update.

Update 2: 89.85 at step 50 Million with QuantUpP = 3.2 and quantUpN = 39. It does perform very well. I will leave it at that. As said in my post, those are the two important parameters (no, it is not a universal super-intelligence in 600 lines of code). Be rational, and think about what the fact that this mechanism works so well means (I am talking to everybody, there).

I looked at it, the informed way.

It gets over 88% with very limited effort.

As I pointed, the two dataset are similar in technical description, but they are 'reversed' in the data.

MNIST is black dots on white background. F-MNIST is white dots on black background. The histograms are very different.

I tried to make it work despite that, just with parameter changes, and it does.

Here are the changes to the code:

on line 555: quantUpP = 1.9 ;

on line 556: quantUpN = 24.7 ;

with rand(1000), as it is in the code, you already clear 86% at step 300,000 and 87% at step 600,000 and 88% at 3 Million.

I had made another, small and irrelevant, change, in my full tests, so I am running the full tests again without it (the value/steps above are from that new series). It seems to be better again without it... maybe a new SOTA (update: touched 88.33% at step 4,800,000 ! ... and 88.5 at 6.8 Millions !. MLPs perform poorly when applied to data even slightly more complicated than MNIST)

I do not understand what is all the hype around MNIST. Once again, this is PI-MNIST and that makes it very different (to put it simply: no geometry, so no convolution).

I would like anybody to give me a reference to some 'other method that worked on MNIST but did not make it further', that uses PI-MNIST and gets more than 98.4% on it.

And if anybody tries it on yet another dataset, could they please notify me so I look at it, before they make potentially damaging statements.

Replies from: maxime-riche
comment by Maxime Riché (maxime-riche) · 2022-01-05T10:35:31.664Z · LW(p) · GW(p)

Here with 2 conv and less than 100k parameters the accuracy is ~92%. https://github.com/zalandoresearch/fashion-mnist

SOTA on Fashion-MNIST is >96%. https://paperswithcode.com/sota/image-classification-on-fashion-mnist

Replies from: D𝜋
comment by D𝜋 · 2022-01-05T10:50:17.150Z · LW(p) · GW(p)

no convolution.

You are comparing pears and apples.

I have shared the base because it has real scientific (and philosophical) value.

Geometry and other are separate, and of lesser scientific value. they are more technology.

Replies from: ZankerH
comment by ZankerH · 2022-01-05T12:52:42.948Z · LW(p) · GW(p)

Your result is virtually identical to the first-ranking unambiguously permutation-invariant method (MLP 256-128-100). HOG+SVM does even better, but it's unclear to me whether that meets your criteria.

Could you be more precise about what kinds of algorithms you consider it fair to compare against, and why?

Replies from: D𝜋
comment by D𝜋 · 2022-01-06T13:42:56.129Z · LW(p) · GW(p)

I am going after pure BP/SGD, so neural networks (no SVM), no convolution,...

No pre-processing either. That is changing the dataset.

It is just a POC, to make a point: you do not need mathematics for AGI. Our brain does not.

I will publish a follow-up post soon.

Replies from: D𝜋
comment by D𝜋 · 2022-01-07T07:28:16.947Z · LW(p) · GW(p)

Also,

No regularisation. I wrote about that in the analysis.

Without max-norm (or maxout, ladder, VAT: all forms of regularisation), BP/SGD only achieves 98.75% (from the dropout -2014- paper).

Regularisation must come from outside the system. - SO can be seen that way - or through local interactions (neighbors). Many papers clearly suggest that should improve the result. 

That is yet to do.

Replies from: crabman
comment by philip_b (crabman) · 2022-01-07T14:37:19.835Z · LW(p) · GW(p)

What is BP in BP/SGD?

So, as I see it, there are three possible different fairness criteria which define what we can compare your model with.

  1. Virtually anything goes - convolutions, CNNs, pretraining on imagenet, ...
  2. Permutation-invariant models are allowed, everything else is disallowed. For instance, MLPs are ok, CNNs are forbidden, tensor decompositions are forbidden, SVMs are ok as long as the transformations used are permutation-invariant. Pre-processing is allowed as long as it's permutation-invariant.
  3. The restriction from the criterion 2 is enabled. Also, the model must be biologically plausible, or, shall we say, similar to the brain. Or maybe similar to how a potential brain of another creature might be? Not sure. This rules out SGD, regularization that uses norm of vectors, etc. are forbidden. Strengthening neuron connections based on something that happens locally is allowed.

Personally, I know basically nothing about the landscape of models satisfying the criterion 3.

Replies from: D𝜋
comment by D𝜋 · 2022-01-07T15:43:52.824Z · LW(p) · GW(p)

BP is Back-Propagation.

We are completely missing the plot here. 

I had to use a dataset for my explorations and MNIST was simple; and I used PI-MNIST to show an 'impressive' result so that people have to look at it. I expected the 'PI' to be understood, and it is not. Note that I could readily answer the 'F-MNIST challenge'.

If I had just expressed an opinion on how to go about AI, the way I did in the roadmap, it would have been just, rightly, ignored. The point was to show it is not 'ridiculous' and the system fits with that roadmap.

I see that your last post is about complexity science [LW · GW]. This is an example of it. The domain of application is nature. Nature is complex, and maths have difficulties with complexity. The field of chaos theory puttered in the 80s for that reason. If you want to know more about it, start with Turing morphogenesis (read the conclusion), then Prigogine. In NN, there is Kohonen

Some things are theoretical correct, but practically useless. You know how to win the lotto, but nobody does it. Better something simple that works and can be reasoned about, even without a mathematical theory. AI is not quantum physics.

Maybe it could be said that intelligence is to cut through all the details to, then, reason using what is left, but the devil is in those details.

comment by lsusr · 2022-01-05T22:32:11.939Z · LW(p) · GW(p)

[Duplicate Comment.]

comment by p.b. · 2022-01-04T16:20:03.111Z · LW(p) · GW(p)

It would be surprising to me if the algorithm really performed this poorly on fashion mnist. F-MNIST is harder, but (intentionally) very similar to MNIST. 

CIFAR maybe with limited categories would be a logical "hard" test IF it can be made to work on F-MNIST. 

On the other hand (without claiming that I understand the ins and outs of the algorithm) I could imagine that out of the neuroinspired playbook it misses the winner-takes-all competition between neurons that allows modelling of multi-modal distribution and possibly allows easier distinction of not-linearly-separable datapoints. 

Replies from: D𝜋
comment by D𝜋 · 2022-01-04T17:06:44.722Z · LW(p) · GW(p)

See my comment on reversing the shades on F-MNIST. I will check it later but I see it gets up to 48% in the 'wrong' order and that is surprisingly good. I worked on CIFAR, but that is another story. As-is it gives bad results and you have to add other 'things'.

As you guessed, I belong to neuroinspired branch and most of my 'giants' belong there. I strongly expected, when I started my investigations, to use some of the works that I knew and appreciated along the lines your are mentioning, and I investigated some of them early on.

To my surprise, I did not need them to get to this result, so they are absent.

The two neuronal layers form of the neocortex is where they will be useful. This is only one layer.

Another (bad) reason, is that they add to the GPU hell... that has limited my investigations. It is identified source of potential improvements.

comment by Bucky · 2022-01-05T11:24:04.361Z · LW(p) · GW(p)

I think there's a mistake which is being repeated in a few comments both here and on D𝜋's post which needs emphasizing. Below is my understanding:

D𝜋 is attempting to create a general intelligence architecture. He is using image classification as a test for this general intelligence but his architecture is not optimized specifically for image identification.

Most attempts on MNIST use what we know about images (especially the importance of location of pixels) and design an architecture based on those facts. Convolutions are an especially obvious example of this. They are very effective at identifying images but the fact that we are inserting some of our knowledge of images into the algorithm precludes it from being a general intelligence methodology (without a lot of modification at least).

The point of using PI-MNIST (where locations of pixels in the dataset are randomized) is that we can't use any of our own understanding of images to help with our model so a model which is good at PI-MNIST is proving a more general intelligence than a model which is good at MNIST.

That is why D𝜋 keeps on emphasizing that this is PI-MNIST.

Replies from: D𝜋, tailcalled
comment by D𝜋 · 2022-01-05T12:20:43.652Z · LW(p) · GW(p)

Spot on.

I hope your explanation will be better understood than mine. Thank you.

It 'so happens' that MNIST (but not PI) can also be used for basic geometry. That is why I selected it for my exploration (easy switch between the two modes).

comment by tailcalled · 2022-01-05T13:08:52.732Z · LW(p) · GW(p)

I think if one wants to test general intelligence, one should throw the algorithm at some problem that requires general intelligence. E.g. if it could reach SOTA on text prediction, that'd be impressive. But I think it would very badly fail at even approaching it, and I don't see any obvious way to improve it.

Replies from: Bucky
comment by Bucky · 2022-01-05T14:11:48.779Z · LW(p) · GW(p)

I suppose it depends how general one is aiming to be. If by general intelligence we mean "able to do what a human can do" then no, at this point the method isn't up to that standard.

If instead we mean "able to achieve SOTA on a difficult problem which it wasn't specifically designed to deal with" then PI-MNIST seems like a reasonable starting point.

Also, from a practical standpoint PI-MNIST seems reasonable for a personal research project.

I do think D𝜋's original post felt like it was overstating it's case. From a later comment [LW(p) · GW(p)] it seems like they more see it as a starting point to add more steps onto to achieve a more general intelligence (i.e. not just a scaling up of the same thing). So instead of paradigms which are MLP + others or DBM + others we would have S(O)NN + others.

comment by D𝜋 · 2022-01-04T15:26:41.862Z · LW(p) · GW(p)

I just discovered about the 'ping back' on LessWrong...

I gave a first read of your description. Most of it is correct. I will check in more details.

I used the terms 'total' and 'groups' to make things simpler, but yours are better.

four corrections:

1.

The potential of a neuron can be negative. It is the pure sum of all weights, positive and negative. There is no 'negative spiking' (It is one of the huge number of things I tried that did not bring any benefit). It think I remember trying to set a bottom limit at 0 (no negative potential) and that, as always, it did not make any real difference...

2.

'Our system thus has four receptors per pixel. Exactly one receptor activates per pixel per image' is incorrect.

The MNIST pixels are grey shades 0-255.

It is reduced down to 4: 0, 1-63, 64-127, 128-191, >=192. (only keep the 2 top bits). That is enough for MNIST. Many papers have noted that the depth can be reduced and it is true.

Images are presented over 4 'cycles', filtered by those 4 limits. In the first cycle, only pixels with a value over 191 are presented, in the second one, those over 127, the third 63, the last, 0 (not nul).

In the code the array 'wd' contains the 4 limits, and at each cycle, the pixel values are tested to be superior or equal to those limits.

Connexions are established with matrix pixels. Over the 4 cycles, they are presented with 4 successive 0 or 1.

If a connexion is on a pixel that shows value 112, on the first cycle it will not be active (<192), on the second it will not be active (<128), but on the third and forth, it will be (>63 and >0).

That is what allows the 'model averaging across cycles'.

From there, you can understand a first, fatal, reason why F-MNIST cannot be processed as-is: the grey shades are reversed. In MNIST, the background is white, in F-MNIST, it is black. So the cycle limits would have to be reversed.

I will have a look at it.

3.

The  computation includes the  column.

Note that the 'highest' ⊥ selection can be easily implemented using population coding with inhibition of the ⊤ column.

I do not know if the options I used are only valid for this dataset or if there have larger validity across dataset as I only have used that one. Maybe they are and you won't have to figure out each time.

4.

When a new connection is established, the initial weight is always the same. It is given as a fraction of the threshold in the variable 'divNew', that is the divider. You can do random, it does not make a difference. You can change it to another value. But it has to be small enough (the divider) that the number of connections of a neuron multiplied by the number of cycles be superior to the threshold, or the system will never 'boot' as no neuron would ever spike. So I use 1/10 of the threshold with 10 connections and 4 cycle, and it is fine.

comment by jessicata (jessica.liu.taylor) · 2022-01-04T21:15:53.297Z · LW(p) · GW(p)

Instead, D𝜋 uses a simple algorithm called a quantilizer.

Note that this is a different algorithm from the original quantilizers.

Replies from: D𝜋
comment by D𝜋 · 2022-01-05T09:51:03.425Z · LW(p) · GW(p)

That is correct.

I am referring to that paper as a vindication of the concept, but I do not use the quantiliser algorithm provided.

The one I use I devised on my own, a long time ago, with the thought experiment described, but it has since been mathematically studied. Actually, when I searched for it and found it the first time, it was in a much simpler version, but I cannot find that one again now... 

I have not been down to every detail of lsusr's rewrite yet, just the main corrections to the description of the mechanism. I had to do F-MNIST first.

Side note: The thought experiment describes a mechanical system. Why should it be called algorithm when implemented in code ? because that makes it un-patentable ? I am not sure a super-intelligent AI could understand human politics.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-01-05T17:24:21.438Z · LW(p) · GW(p)

When I saw D𝜋's original post one of my first thoughts was "Maybe I should ask Lsusr to look into this and try to replicate it." Well done!

Replies from: lsusr
comment by lsusr · 2022-01-05T23:04:47.255Z · LW(p) · GW(p)

Thanks. I also thought that Lsusr should look into this and try to replicate it.

comment by tailcalled · 2022-01-04T08:55:58.329Z · LW(p) · GW(p)

In the original article, Dpi claims:

When a neuron spikes, the connections that have contributed to it are reinforced. It is Hebb’s rule. In this system, we do the same.

But this doesn't seem to be present in your description? Unless I missed it.

Replies from: D𝜋, lsusr
comment by D𝜋 · 2022-01-04T15:31:17.315Z · LW(p) · GW(p)

It is not, per se, Hebb's rule. Hebb's rule is very general. I personally see this as belonging to it, that's all. I give attributions where is think it is deserved.

Replies from: D𝜋
comment by D𝜋 · 2022-01-06T13:58:35.939Z · LW(p) · GW(p)

... and it is in this description:

"The spiking network can adjust the weights of the active connections"

comment by lsusr · 2022-01-04T10:28:24.845Z · LW(p) · GW(p)

You didn't miss it. Your quote and the later bit about "The question to ask is not ‘how’ to learn, but ‘when’." seem to contradict each other. I think your quote is just a general allegory to Hebb's rule and that it's not meant to be taken as a literal system spec, but I could be wrong. I am a confused by the original description.

Replies from: D𝜋
comment by D𝜋 · 2022-01-04T16:09:55.495Z · LW(p) · GW(p)

That actually brings us to the core of it.

The way I phrased that was, deliberately, ambiguous.

Since 1958, the question the field has been trying to answer is how to transfer the information we get when a sample is presented, to the weights, so next time it will perform better.

BP computes the difference between what would be expected and what is measured, and the propagates it to all intermediary weights according to a set of mathematically derived rules (the generalised delta rule). A lot of work as gone into figuring out the best way to do that. This is what I called 'how' to learn.

In this system, the method used is just the simplest possible, and the most intuitive, one: INC and DEC of the weights depending on wether it is or not the correct answer. 

The quantiliser, then, tell the system to only apply that simple rule under certain conditions (the Δ⊤and  limits). That is 'when'.

You can use the delta rule instead of our basic update rule if you want (I tried). The result is not better and it is less stable, so you have to use small gradients. The problem, as I see it, is that the conditions under witch Jessica Taylor's theorem is valid are not met any more and you have to 'fix' that. I did not investigate that extensively.
 

comment by Nicholas / Heather Kross (NicholasKross) · 2022-01-04T05:36:08.920Z · LW(p) · GW(p)

It'd be cool to see like benchmarks (accuracy + performance) compared to MLP-based ANNs

Replies from: D𝜋
comment by D𝜋 · 2022-01-06T13:54:58.161Z · LW(p) · GW(p)

It is not a toolbox you will be using tomorrow.

I applied it to F-MNIST, in a couple of hours after being challenged, to show that is not just only MNIST. I will not do it again, that is not the point.

It is a completely different approach to AGI, that sounds so ridiculous that I had to demonstrate that it is not, by getting near SOTA on one widely used dataset (so PI-MNIST) and finding relevant mathematical evidence.

comment by Kenoubi · 2022-01-17T20:15:34.008Z · LW(p) · GW(p)

One point which is never mentioned (as far as I can see) either in D𝜋's article or in this one, but which is present in the code, is the minimum threshold for learning. D𝜋's code only learns if the neuron's potential is already above a certain minimal value (300,000 in the code as posted). D𝜋 says the maximum threshold is optional, but in my experiments, I found the minimum threshold absolutely crucial for the network to ever get to high levels of accuracy. Seems like otherwise a neuron gets distracted by a bunch of junk updates that aren't really related to its particular purpose.

Replies from: D𝜋
comment by D𝜋 · 2022-01-18T20:23:06.348Z · LW(p) · GW(p)

Welcome onboard this IT ship to baldly go where no one as gone before !

Indeed, I just wrote 'when it spikes' and further as the 'low threshold' and no more. I work in complete isolation and some things are so obvious inside my brain that I do not consider them as non obvious to others.

It is part of the 'when' aspect of learning, but uses an internal state of the neuron instead of an external information from the quantilisers.

If there is little reaction to a sample in a neuron (spiking does happen slowly, or not), it is meaningless and you should ignore it. If it comes too fast, it is already 'in' the system and there is no point in adding to it. You are right to say the first rule is more important than the second.

Originally, there was only one threshold instead of 3.  When learning, the update would only take place if the threshold was reached after a minimum of two cycles (or 3, but then it converges unbearably slowly), and only for the connections that had been active at least twice. I 'compacted' it for use within one cycle (to make it look simpler), so it was 50% of the threshold minimum, and then adjusted (might as well) that value by scanning around and, then, added the upper threshold, but more to limit the number of updates than to improve the accuracy (although it contributes a small bit). The best result is with 30% and 120%, whatever the size or the other parameters.

Before I write this, I quickly checked on PI-F-MNIST. It is still ongoing, but it seems to hold true even on that dataset (BTW: use quantUpP = 3.4 and quantUpN = 40.8 to get to 90.2% with 792 neurons and 90.5% with 7920).

As it seems you are interested, feel free to contact me through private message. There is plenty more in my bag than can fit in a post or comment. I can provide you some more complete code (this one is triple distilled).

Thank you very much for your interest.

comment by Measure · 2022-01-04T14:51:21.175Z · LW(p) · GW(p)

For each ⊥i column, if Δ⊥i is in the top 0.66% of those seen before then we perform negative reinforcement.

Do we negatively reinforce False columns that are especially distant from the True column? That seems backward.

Replies from: D𝜋
comment by D𝜋 · 2022-01-04T15:40:29.492Z · LW(p) · GW(p)

'those seen before' are.values of Δ⊥i across all samples seen before, not within a sample.