Automatic programming, an example

post by Thomas · 2012-02-01T20:55:18.448Z · LW · GW · Legacy · 36 comments

Say, that we have the following observational data:

 

Planet Aphelion
000 km
Perihelion
000 km
Orbit time
days
Mercury 69,816 46,001 88
Venus 108,942 107,476 225
Earth 152,098 147,098 365
Mars 249,209 206,669 687
Jupiter 816,520 740,573 4,332
Saturn 1,513,325 1,353,572 10,760
Uranus 3,004,419 2,748,938 30,799
Neptune 4,553,946 4,452,940 60,190
Pluto 7,311,000 4,437,000 90,613

 

The minimal, the maximal distance between a planet and the Sun (both in thousands of kilometres) and the number of (Earth) days for one revolution around the Sun. Above is only the empirical data and no binding algorithm among the three quantities. The celestial mechanics rules which go by the name of the Kepler's laws. Can those rules be (re)invented by a computer program and how?

The following program code will be put into a simulator:

//declarations of the integer type variables
$DECLAREINT bad perihelion aphelion orbit guess dif temp zero temp1

//table with the known data in a simulator friendly format
$INVAR perihelion(46001) aphelion(69816) orbit(88)
$INVAR perihelion(107476) aphelion(108942) orbit(225)
$INVAR perihelion(147098) aphelion(152098) orbit(365)
$INVAR perihelion(206669) aphelion(249209) orbit(687)
$INVAR perihelion(740573) aphelion(816520) orbit(4332)
$INVAR perihelion(1353572) aphelion(1513325) orbit(10760)
$INVAR perihelion(2748938) aphelion(3004419) orbit(30799)
$INVAR perihelion(4452940) aphelion(4553946) orbit(60190)
$INVAR perihelion(4437000) aphelion(7311000) orbit(90613)

// variables orbit and bad can't be touched by the simulator
//to avoid a degeneration to a triviality
$RESVAR orbit bad

//do NOT use if clause, while clause do not set direct numbers ...
$RESCOM if while val_operation inc_dec

//bad is the variable, by which the whole program will be judged
//a big value of bad is bad. By this criteria programs will be wiped out
//from their virtual existence. A kind of anti-fitness
$PENVAL bad

//do show the following variables when simulating
$SHOWVAR bad,orbit,guess,dif

//penalize any command with 0 (nothing) and every line by 1 point
$WEIGHTS commands=0 lines=1

//minimize the whole program to 20 lines or less
$MINIMIZE lines 20

$BES
//the arena, where algorithms will be
//created and the fittest only will survive
$EES

//testing area where the simulator has no write access to
//here the bad (the penalized variable) is calculated
//bigger the difference between the known orbit and the variable guess
//worse is the evolved algorithm
dif=orbit-guess;
dif=abs(dif);
bad=dif;
temp=dif;
temp*=10000;
temp1=temp/orbit;
temp=temp1*temp1;
bad=bad+temp;
//end of the testing area

 

After several hours the following C code has been evolved inside of the $BES - $EES segment.

 

aphelion=perihelion+aphelion;
aphelion=aphelion+aphelion;
aphelion=aphelion+aphelion;
guess=12;
aphelion=aphelion>>guess;
temp=aphelion/guess;
aphelion=aphelion-temp;
dif=sqrt(aphelion);
aphelion=guess|aphelion;
aphelion=aphelion*dif;
aphelion=guess^aphelion;
guess=aphelion/guess;

 

What the simulator does? It bombards the arena segment with a random C commands. Usually it then just notices a syntax error and repairs everything to the last working version. If everything is syntactically good, the simulator interprets the program and checks if the mutated version causes any run-time error like division by zero, a memory leak and so on. In the case of such an error it returns to the last good version. Otherwise it checks the variable called "bad", if it is at least as small as it was ever before. In the case it is, a new version has just been created and it is stored.

The evolutionary pressure is working toward ever better code, which increasingly well guesses the orbit time of nine planets. In this case the "orbit" variable has been under the $RESVAR clause and then the "gues" variable has been tested against the "orbit" variable. Had been no "$RESVAR orbit" statement, a simple "guess=orbit;" would evolve quickly. Had been no "$RESVAR bad" statement a simple "bad=-1000000;" could derail the process.

Many thousands of algorithms are born and die every second on a standard Windows PC inside this simulator. Million or billion generations later, the digital evolution is still running, even if an excellent solution has been already found.

And how good approximation for the Kepler (Newton) celestial mechanics of the Solar system we have here?

This good for the nine planets where the code evolved:

Planet Error %
Mercury 0.00
Venus 0.44
Earth 0.27
Mars 0.29
Jupiter 0.16
Saturn 0.65
Uranus 0.10
Neptune 0.79
Pluto 1.08

 

And this good for the control group of a comet and six asteroids:

 

Asteroid/Comet Error %
Halley 1.05
Hebe 1.37
Astraea 1.99
Juno 3.19
Pallas 1.66
Vesta 2.49
Ceres 2.02

 

Could be even much better after another billion generations and maybe with even more $INVAR examples. Generally, you can pick any three columns from any integer type table you want. And see this way, how they are related algorithmically. Can be more than three columns also.

The name of the simulator (evoluator) is Critticall and it is available at http://www.critticall.com

36 comments

Comments sorted by top scores.

comment by wmorgan · 2012-02-01T22:10:13.652Z · LW(p) · GW(p)

The generated code is bizarre. I refactored it as well as I could, and it still doesn't make much sense:

aphelion = (aphelion + perihelion) >> 10;
aphelion = aphelion - (aphelion / 12);
guess = ( ( (aphelion | 12) * (int)sqrt(aphelion) ) ^ 12 ) / 12;

"To get the orbit time in days from the aphelion and perihelion in Kkm, first sum them and divide by 1024. Then from that, subtract one twelfth. Then, to the value, perform a bitwise OR with 0x0C, multiply by the square root, and bit-XOR 0x0C again. Finally, divide by 12, and that will give you the number of days."

Replies from: jimrandomh, RolfAndreassen, Thomas, Douglas_Knight
comment by jimrandomh · 2012-02-02T02:03:55.753Z · LW(p) · GW(p)

The three problems with the code are that the variable names are all lies, there's a bunch of redundant rescaling which isn't consolidated because it's done in integer math when it should be floating point, and there are a couple bits of overfitting (bitwise operators) that don't belong. If you convert to SSA and wipe out the misleading names, you get:

a1 = perihelion+aphelion  Real
a2 = a1+a1                  Rescaling
a3 = a2+a2                  Rescaling
g1 = 12                     Rescaling
a4 = a3>>g1                 Rescaling
t1 = a4/g1                  Rescaling
a5 = a4-t1                  Rescaling
d1 = sqrt(a5)             Real
a6 = g1|a5                  Overfit
a7 = a6*d1                Real
a8 = g1^a7                  Overfit
guess = a8/g1               Rescaling

If you replace the overfitting with pass-throughs (a6=a5, a8=a7), then pretend it's floating point so that you can consolidate all the rescaling into a single multiplicative constant, you get

guess = k * (perihelion+aphelion)*sqrt(perihelion+aphelion)

Which is Kepler's third law.

comment by RolfAndreassen · 2012-02-01T22:28:34.500Z · LW(p) · GW(p)

This reminds me of the discussion from last week of the code that a self-modifying AI might produce; I said then that I thought people were not thinking Weirdly enough. This is indeed an example of Weirdness. Obviously no human would come up with such a thing. Yet it works, more or less; although the hardcoded numbers make me suspect that its range of applicability may not be great. Testing it on some numbers typical of, say, solar orbits around the galactic core, might produce some surprising results.

It would also be interesting to compare this for speed, as well as accuracy, with a Kepler's-law implementation.

comment by Thomas · 2012-02-02T07:15:19.770Z · LW(p) · GW(p)

divide by 1024

Actually by 4096. And it is a rescaling as jimrandomh points out.

Replies from: FeepingCreature
comment by FeepingCreature · 2012-02-02T11:33:06.682Z · LW(p) · GW(p)

Am I crazy? A right shift by 10 is equivalent to a division by 2^10. 2^10 is 1024..

Replies from: Thomas
comment by Thomas · 2012-02-02T17:28:46.737Z · LW(p) · GW(p)

No. The posted code has a bit shift right for 12 places. The already optimized code by wmorgan has a bit shift for only 10 bits.

The metacommand $RESCOM if while val_operation inc_dec caused this. Having two constants (10 and 12) would be undesirable be cause of this "val_operation" and therefore only the constant 12 was used.

Replies from: wmorgan, jimrandomh
comment by wmorgan · 2012-02-02T19:18:41.000Z · LW(p) · GW(p)

This is the generated code segment:

aphelion=aphelion+aphelion;
aphelion=aphelion+aphelion;
guess=12;
aphelion=aphelion>>guess;

Those four lines together amount to a shift 10 bits to the right, i.e., division by 1024.

I think you understand what's going in the code. The point of my refactoring was to make something that was human-readable: something that I could describe in English. And the English for those four lines of code is "divide by 1024." That's what those four lines do.

Replies from: Thomas, Thomas
comment by Thomas · 2012-02-03T17:28:23.311Z · LW(p) · GW(p)

We can modify the above code to:

$DECLAREINT aphelion perihelion dif guess temp $RINVAR aphelion(1000,100000) perihelion(1000,100000) $RETVAR guess if (aphelion>guess; temp=aphelion/guess; aphelion=aphelion-temp; dif=sqrt(aphelion); //aphelion=guess|aphelion; aphelion=aphelion*dif; //aphelion=guess^aphelion; guess=aphelion/guess; $EES As you can see, there is no $RESCOM metacommand and the two "overfit" OR and XOR lines has also been commented or neutralized. Random aphelions and perihelions are between 1 million and 100 million km now. If aphelion is greater than perihelion they are swapped first. The intermediate variable "temp" is then reset to 0 before the "Kepler segment" begins. If it wasn't reset, the simulator would simply use it! Simulator comes out with this in the $BES-$EES section:

aphelion=perihelion+aphelion; aphelion>>=10; guess=12; temp=aphelion/guess; aphelion=aphelion-temp; perihelion=sqrt(aphelion); perihelion=perihelion*aphelion; guess=perihelion/guess; Less lines, but now two constants (10 and 12) (approximately) scale days with mega-meters here, where the Sun is this massive. Initially, it was only the 12 constant, which shifted, divided and was also a XOR argument to fit some more.

Both codes here, inside the $BES-$EES segments are exactly equivalent regarding the outputs and are a distant ancestor-descendant pair. Many million generations apart, but not very much different.

comment by Thomas · 2012-02-03T11:26:16.708Z · LW(p) · GW(p)

We can modify the above code to:

$DECLAREINT aphelion perihelion dif guess temp $RINVAR aphelion(1000,100000) perihelion(1000,100000) $RETVAR guess

if (aphelion<perihelion) { temp=perihelion; perihelion=aphelion; aphelion=temp; }

temp=0;
$BES aphelion=perihelion+aphelion; aphelion=aphelion+aphelion; aphelion=aphelion+aphelion; guess=12; aphelion=aphelion>>guess; temp=aphelion/guess; aphelion=aphelion-temp; dif=sqrt(aphelion); //aphelion=guess|aphelion; aphelion=aphelion*dif; //aphelion=guess^aphelion; guess=aphelion/guess; $EES

As you can see, there is no $RESCOM metacommand and the two "overfit" OR and XOR lines has also been commented or neutralized. Random aphelions and perihelions are between 1 million and 100 million km now. If aphelion is greater than perihelion they are swapped first. The intermediate variable "temp" is then reset to 0 before the "Kepler segment" begins. If it wasn't, the simulator would simply use it!

Simulator comes out with this in the $BES-$EES section:


aphelion=perihelion+aphelion;
aphelion>>=10;
guess=12;
temp=aphelion/guess;
aphelion=aphelion-temp;
perihelion=sqrt(aphelion);
perihelion=perihelion*aphelion;
guess=perihelion/guess;

Less lines, but now two constants (10 and 12) (approximately) scale days and mega-meters here, where the Sun is so massive. Before, it was only the constant of 12, which shifted, divided and was also a XOR argument to fit some more.

Both codes here, inside the $BES-$EES segment are exactly equivalent regarding the output and are a distant ancestor-descendant pair. Many million generations apart.

comment by jimrandomh · 2012-02-02T19:13:33.434Z · LW(p) · GW(p)

The extra two places of bit shifting cancel with two previous self-additions.

Replies from: Thomas
comment by Thomas · 2012-02-02T19:29:21.168Z · LW(p) · GW(p)

I know and I agree with this.

Then you have two different constants (10 and 12). One for the shifting and another for the division. It's nothing wrong with that, but the simulator was prevented to have more constants then absolutely necessary. So everything was done with the "12" and I was discussing that.

comment by Douglas_Knight · 2012-02-02T01:47:19.983Z · LW(p) · GW(p)

The XOR with 12 won't do much after dividing by 12. For small radii, OR with 12 (in units of about 10^6km) will have an effect. These two constants are probably just overfitting. Indeed, it nails Mercury, the smallest and thus most vulnerable to these effects.* Rounding** the square root is also probably overfitting or just noise. It will have a larger effect, but smooth across planets, so it is probably canceled out by the choice of other numbers. Ignoring those three effects, it is a constant times the 3/2 power of average of the axes. The deviation from Kepler's law is that it should ignore perihelion.*** But for un-eccentric orbits, there's no difference. Since the training data isn't eccentric, this failure is unsurprising. That is, the code is unsurprising; that the code is so accurate is surprising. That it correctly calculates the orbital period of Halley's comet, rather than underestimating by a factor of 2^(3/2) is implausible.***

* The control group is too homogeneous. If it contained something close in, overfitting for Mercury would have been penalized in the final evaluation.

** Are you sure it's rounding? [Edit: Yes: bitwise operations are strongly suggestive.]

*** These statements are wrong because I confused apehelion with the semi-major axis. So removing the bitwise operations yields exactly Kepler's law. If you switch from ints to doubles it becomes more accurate. But wmorgan has a constant error: it is divide by 4096, not 1024. This should make the rounding errors pretty bad for Mercury. Maybe the bitwise operations are to fix this, if they aren't noise. My C compiler does not reproduce the claimed error percentages, so I'm not going to pursue this.

comment by jmmcd · 2012-02-02T19:23:28.426Z · LW(p) · GW(p)

This is a fairly typical example of genetic programming, albeit written in a slightly "outsider" way.

No criticism, but readers should be given the standard name of the technique.

EDIT in particular it's GP symbolic regression. On the other hand it's atypical in that there's no crossover.

comment by wmorgan · 2012-02-01T22:13:40.730Z · LW(p) · GW(p)

Very interesting demonstration. Thanks for sharing this; it was fun to read through! I think I have a pretty good idea of how it works.

As a professional programmer:

That code it generated...is really, really shitty. It's unreadable, and for that reason, a human cannot look at the generated code and figure out "what's going on," i.e. Kepler's laws. Insofar as it works, it's much more reminiscent of 0x5f3759df, but that algorithm was optimizing for speed, not correctness or elegance.

I'm not surprised that the algorithm does worse on the control group, for the same reason that I'd question the assumption that it will do better on future generations. It could easily be over-fitting, partially because there is no selection pressure for an elegant solution. Empirically, elegant code does better in novel contexts.

Replies from: shminux
comment by shminux · 2012-02-02T01:23:27.776Z · LW(p) · GW(p)

"Evolution is dumb, but it works" -- I believe that showing that was the (implicit) goal, and it was amply demonstrated.

comment by siodine · 2012-02-01T22:11:49.020Z · LW(p) · GW(p)

Related.

Replies from: loveandthecoexistenc
comment by loveandthecoexistenc · 2012-02-04T01:29:32.291Z · LW(p) · GW(p)

Also related: mirror.fem-net.de/CCC/28C3/mp4-h264-HQ/28c3-4764-en-automatic_algorithm_invention_with_a_gpu_h264.mp4.torrent - another outsider perspective explanation of genetic programming.

Replies from: Anubhav, pedanterrific
comment by Anubhav · 2012-02-04T16:45:52.249Z · LW(p) · GW(p)

You need to escape the underscores.

comment by pedanterrific · 2012-02-04T18:26:13.148Z · LW(p) · GW(p)

Also related - another outsider perspective explanation of genetic programming.

comment by Incorrect · 2012-02-01T21:54:49.057Z · LW(p) · GW(p)

Usually it then just notices a syntax error and repairs everything to the last working version

That sounds like a terrible waste. Why not use a tree-based model of C that excludes as many invalid programs as practical?

Replies from: Thomas, shminux, jmmcd
comment by Thomas · 2012-02-02T07:36:21.037Z · LW(p) · GW(p)

That sounds like a terrible waste.

Technically speaking this is a very swift part, since every syntax error is detected before the code is rewritten and the mutations are just abandoned.

comment by shminux · 2012-02-02T01:21:02.407Z · LW(p) · GW(p)

Most random mutations are fatal, so it fits perfectly.

Replies from: RolfAndreassen
comment by RolfAndreassen · 2012-02-02T03:28:03.936Z · LW(p) · GW(p)

That depends on whether your purpose is to produce an analog with evolution, which we already have plenty of, or a new method for making working code.

comment by jmmcd · 2012-02-02T19:48:43.656Z · LW(p) · GW(p)

That is the more common approach. Grammatical evolution is one method.

comment by Houshalter · 2015-03-14T08:45:39.463Z · LW(p) · GW(p)

Eureqa is a much better tool for this. It come up with the solution Orbit = 0.01239*Aphelion in a few seconds. Some other solutions it found: http://i.imgur.com/tZwBedp.png?1

Replies from: Thomas
comment by Thomas · 2017-09-11T07:10:29.983Z · LW(p) · GW(p)

Kepler's law isn't O=constant*A. This is very wrong, silly even.

comment by Dmytry · 2012-02-06T20:05:46.866Z · LW(p) · GW(p)

The genetic programming may work better for creating a strong AI than to solve that kind of problem. The intelligence did evolve, heh, heh.

If you are trying to evolve a program that solves something of this kind, you end up with an entirely wrong algorithm, constants in which are going to evolve to make the best outcome there can be. The solution just gets stuck in local maximum. Like human eye that has retina backwards. It is more productive to try to iterate over possible solutions starting from the simplest (to implement occam's razor), using some separate process (hill climbing from several random points?) to tune the constants to their best.

comment by Luke_A_Somers · 2012-02-02T15:58:44.514Z · LW(p) · GW(p)

Given the overall lack of eccentricity among the planets, I suspect this would converge faster if you started over, switching the training and test groups.

comment by 911truther · 2012-02-01T23:39:24.754Z · LW(p) · GW(p)

This idea is based on a whole range of confusions and misunderstandings. Program code does not have the redundancy or flexibility of genetics, as you know with syntax errors it shatters if a single character is wrong - for this reason it's a mistake to use it as the carrier for the genetic paradaigm. Another mistake is extrapolating from a finite data source: you can't expect to get a correct program this way, the code cannot contain any algorithmic insight into the data for any reason other than pure chance. As you'll know the software crisis is still ongoing,and Critticall is just another example of people not understanding the true nature of programs and trying to get something for nothing.

Replies from: satt, army1987, Viliam_Bur
comment by satt · 2012-02-02T03:15:31.072Z · LW(p) · GW(p)

This is as good an opportunity as any to link this presentation and these cool papers about evolving patches for real-world C programs in minutes. In light of an actual working example of a computer program that fixes other programs' bugs, broad brush claims about how programs can't have algorithmic insight, won't lead to correct programs, and lack "the redundancy or flexibility of genetics" won't really wash.

Replies from: 911truther
comment by 911truther · 2012-02-02T16:24:58.665Z · LW(p) · GW(p)

This research (it is a single piece of research written up in 4 different ways) simply concerns taking a rough piece of wood (a program that is almost correct) and sanding down the edges (fixing a small number of test cases). I didn't say "programs can't have algorithmic insight", I said randomly generating problems by "evolutionary" means will not contain insight by any other means than coincidence. The research you linked doesn't contradict that because all it concerns is smoothing down rough edges. Degeneracy is one of the fundamental features of a genetic code that is required for the theory of evolution to apply so I don't know why you say that "doesn't wash", it's a fact.

Replies from: jmmcd
comment by jmmcd · 2012-02-02T19:45:42.732Z · LW(p) · GW(p)

Degeneracy is an important feature of DNA-based evolution (biological evolution as-it-is) but it's not fundamental to evolution as-it-could-be.

comment by A1987dM (army1987) · 2012-02-02T19:47:58.310Z · LW(p) · GW(p)

Program code does not have the redundancy or flexibility of genetics, as you know with syntax errors it shatters if a single character is wrong

And a single mutation in DNA in the wrong place can be disastrous, too.

comment by Viliam_Bur · 2012-02-02T12:16:58.640Z · LW(p) · GW(p)

Program code does not have the redundancy or flexibility of genetics

A well-written program does not have it, but many real-life programs do. In my programming environments I have usually warnings turned very sensitive, for example they complain about declared but never used variables/functions, and when I read other people's code, I see this often.

with syntax errors it shatters if a single character is wrong

Could we see a program with syntax error as a fatal mutation? Fortunately, these fatal mutations are easy to detect. (Alternatively, we could mutate the parse trees.)

Another mistake is extrapolating from a finite data source: you can't expect to get a correct program this way, the code cannot contain any algorithmic insight into the data for any reason other than pure chance.

Still I have seen humans do this and take their paychecks. Yes, I would expect such evolved program be full of bugs; but real-life applications also contain bugs and it does not prevent them from being useful. I admit I would hate to read the machine-generated code, or even worse fix the bugs or add new functionality. But this is like saying that reading DNA is, ahem, not very user-friendly. Sure, it is not; but it works.

Two more things to consider. First, unit tests -- the better unit tests you make, the higher chance is that a random program that passes them is correct. If test-driven development can reduce random human errors, it can also reduce computer-generated error, though of course the computer will make much more errors. Second, we are OK with randomized algorithms having some chance of error, as long at the chance is very very small.

So the prior probability of computer generating a correct algorithm is small, but non-zero. By making unit tests we increase the probability that the result will be correct. If we can increase this probability from epsilon to 1 - epsilon, we win.

Replies from: 911truther
comment by 911truther · 2012-02-02T16:35:19.566Z · LW(p) · GW(p)

Randomized algorithms are completely different to random program code: Randomized algorithms actually fit a precise specification while randomized algorithms just (try to) fit a range of test cases (hopefully I don't need to explain why a range of test cases is not an adequete specification). It's a misconception that unit tests help increase the probability of mistakes, A unit test cannot ever tell you that your program is free of errors. They're the complete wrong paradaigm for making correct programs.

You are trying to submit too fast. try again in 9 minutes. every time I post a comment.

Replies from: Viliam_Bur
comment by Viliam_Bur · 2012-02-03T09:41:58.345Z · LW(p) · GW(p)

Randomized algorithms are completely different to random program code

The distinction between data and code is mostly useful, but in some situations we intentionally cross the boundary. If you have a text editor, then the separation is valid -- the program is code, and the text documents are data. But if you have a compiler, you have a compiler program, which is code, and the compiled program, which is data, but on a different level it is also code. If you have a dynamically loaded library, while you load it, it is data, but then you start treating it as code by running it. If you have an interpreter, which is a code, you emulate the program, which on one level is data and simultaneously on another level is code. If you have a decompiler or debugger, which is code, it takes another code and treats it like data. A bootloader loads data and then starts them as a code. Etc.

This means that treating data as code, or code as data, is not necessarily a mistake or confusion of levels, and sometimes it is unavoidable. How else could you for example start a program, which originally exists only as a sequence of bytes in a file on a disk? You treat it as data when loading the file to a memory, and then you treat it as a code.

A unit test cannot ever tell you that your program is free of errors.

Some maths here: let X be a problem we want to solve, and let's suppose that the problem is algorithmically solvable. Further, let P be a set of all programs, PX a set of programs that correctly solve problem X, U is a unit test and PU is a set of programs that passes the unit test.

We will call U a correct unit test for problem X if each program from PX belongs to PU (a correct program always passes the unit test) and some programs in P-PX don't belong to PU (some incorrect programs don't pass the unit test). In other words, PX is a subset of PU, which is a strict subset of P.

A random code generator will provide a random distribution R of programs. Now the question is whether R ∩ PX is non-empty, in other words whether the generator is capable of generating a correct solution assuming it is incredibly lucky. This depends on the generator and the problem. For example if generator only creates random programs up to 100 lines of code, but the shortest possible code that solves the problem has 150 lines, then the generator cannot create a correct program. But if the generator is capable to create a program of any length, or more precisely capable of creating any program, then if the program is algorithmically solvable, the generator will generate the correct program with a non-zero probability. Just imagine that you wrote the correct program -- well, with some non-zero probability, the output of the random generator will be exactly the same text. The only problem is that p(x in PX | x in R) is very small. Sometimes so small that even if we could reliably automatically test whether a program belongs to PX, we just wouldn't have enough time to wait until first such program is generated. But let's assume that it is a simple enough program, so if we generate a few millions of programs, we can expect that some of them belong to PX.

Yeah, we have a couple assumptions here. Again, they are: (1) a correct program exists, and (2) some correct program has Kolmogorov complexity small enough (i.e. has a short code, if written optimally) that we can realistically wait until a random generator generates it. Actually, also (3) some correct program with small Kolmogorov complexity runs fast enough that it will pass the unit test in given time limit.

Assuming this all, we have a set of generated random programs R, some of them belong to PX, and we are trying to find them. If we replace the set R with R ∩ PU, we have a smaller set, and R ∩ PX is a subset of R ∩ PU, so we did not lose any solution. In probabilistic terms, p(x in R ∩ PX | x in R ∩ PU) is greater than p(x in PX | x in R), in human speech: if we choose a random program that passes the unit test, we have better chance than if we choose any random program.

Question is, how much does the test U reduce the set of all programs. If only 1% of programs passes the unit test, then p(x in R ∩ PX | x in R ∩ PU) = 100×p(x in R ∩ PX | x in R), so by doing the unit test we increased our luck 100 times... though the result may be still pretty close to zero. We can filter the results by another test, but now we need that only a few of programs that have passed the previous test can pass the next test. Which may be difficult; and also pretty difficult to estimate in advance. We can only measure it afterwards, how many programs that passed the first test were filtered by the second one. The idea is, if a generated program has chance 1:10^10 to be correct, if we can make a sequence of 5 unit tests where each one filters out 99% of programs that have passed the previous tests, we have a decent chance at the end.

Of course to make all this work, we need a decent generator, which will give programs reasonable prior probabilities, therefore it would be better to generate parse trees than source texts, etc. We could even invent a new language that could allow us to write shorter programs. For example it should have an instruction "forAllItemsInSet(SET) { X | CODE }" made of 4 symbols, instead of "for (X = 1; X < SET.size; X++) { CODE }" with 11 symbols, etc.