Posts

Comments

Comment by marks on Significance of Compression Rate Method · 2010-06-06T04:28:37.982Z · LW · GW

This isn't precisely what Daniel_Burfoot was talking about but its a related idea based on "sparse coding" and it has recently obtained good results in classification:

http://www.di.ens.fr/~fbach/icml2010a.pdf

Here the "theories" are hierarchical dictionaries (so a discrete hierarchy index set plus a set of vectors) which perform a compression (by creating reconstructions of the data). Although they weren't developed with this in mind, support vector machines also do this as well, since one finds a small number of "support vectors" that essentially allow you to compress the information about decision boundaries in classification problems (support vector machines are one of the very few things from machine learning that have had significant and successful impacts elsewhere since neural networks).

The hierarchical dictionaries learned do contain a "theory" of the visual world in a sense, although an important idea is that they do so in a way that is sensitive to the application at hand. There is much left out by Daniel_Burfoot about how people actually go about implementing this line of thought.

Comment by marks on Significance of Compression Rate Method · 2010-06-06T04:13:29.746Z · LW · GW

(A text with some decent discussion on the topic)[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html]. At least one group that has a shot at winning a major speech recognition benchmark competition uses information-theoretic ideas for the development of their speech recognizer. Another development has been the use of error-correcting codes to assist in multi-class classification problems (google "error correcting codes machine learning")[http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=error+correcting+codes+machine+learning] (arguably this has been the clearest example of a paradigm shift that comes from thinking about compression which had a big impact in machine learning). I don't know how many people think about these problems in terms of information theory questions (since I don't have much access to their thoughts): but I do know at least two very competent researchers who, although they never bring it outright into their papers, they have an information-theory and compression-oriented way of posing and thinking about problems.

I often try to think of how humans process speech in terms of information theory (which is inspired by a couple of great thinkers in the area), and thus I think that it is useful for understanding and probing the questions of sensory perception.

There's also a whole literature on "sparse coding" (another compression-oriented idea originally developed by biologist but since ported over by computer vision and a few speech researchers) whose promise in machine learning may not have been realized yet, but I have seen at least a couple somewhat impressive applications of related techniques appearing.

Comment by marks on Significance of Compression Rate Method · 2010-06-06T03:41:47.278Z · LW · GW

I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it's just that the key insight in the compression is not to just "minimize entropy" but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).

Namely, in Hinton's algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he's more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the "deeper" nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).

Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.

Comment by marks on Cultivating our own gardens · 2010-06-02T18:28:29.608Z · LW · GW

This attacks a straw-man utilitarianism, in which you need to compute precise results and get the one correct answer. Functions can be approximated; this objection isn't even a problem.

Not every function can be approximated efficiently, though. I see the scope of morality as addressing human activity where human activity is a function space itself. In this case the "moral gradient" that the consequentialist is computing is based on a functional defined over a function space. There are plenty of function spaces and functionals which are very hard to efficiently approximate (the Bayes predictors for speech recognition and machine vision fall into this category) and often naive approaches will fail miserably.

I think the critique of utility functions is not that they don't provide meaning, but that they don't necessarily capture the meaning which we would like. The incoherence argument is that there is no utility function which can represent the thing we want to represent. I don't buy this argument mostly because I've never seen a clear presentation of what it is that we would preferably represent, but many people do (and a lot of these people study decision-making and behavior whereas I study speech signals). I think it is fair to point out that there is only a very limited biological theory of "utility" and generally we estimate "utility" phenomenologically by studying what decisions people make (we build a model of utility and try to refine it so that it fits the data). There is a potential that no utility model is actually going to be a good predictor (i.e. that there is some systematic bias). So, I put a lot of weight on the opinions of decision experts in this regard: some think utility is coherent and some don't.

The deontologist's rules seem to do pretty well as many of them are currently sitting in law books right now. They form the basis for much of the morality that parents teach their children. Most utilitarians follow most of them all the time, anyway.

My personal view is to do what I think most people do: accept many hard constraints on one's behavior and attempt to optimize over estimates of projections of a moral gradient along a few dimensions of decision-space. I.e. I try to think about how my research may be able to benefit people, I also try to help out my family and friends, I try to support things good for animals and the environment. These are areas where I feel more certain that I have some sense where some sort of moral objective function points.

Comment by marks on Cultivating our own gardens · 2010-06-02T02:39:08.515Z · LW · GW

I would like you to elaborate on the incoherence of deontology so I can test out how my optimization perspective on morality can handle the objections.

Comment by marks on Cultivating our own gardens · 2010-06-02T02:37:46.465Z · LW · GW

To be clear I see the deontologist optimization problem as being a pure "feasibility" problem: one has hard constraints and zero gradient (or approximately zero gradient) on the moral objective function given all decisions that one can make.

Of the many, many critiques of utilitarianism some argue that its not sensible to actually talk about a "gradient" or marginal improvement in moral objective functions. Some argue this on the basis of computational constraints: there's no way that you could ever reasonably compute a moral objective function (because the consquences of any activity are much to complicated) to other critiques that argue the utilitarian notion of "utility" is ill-defined and incoherent (hence the moral objective function has no meaning). These sorts of arguments undermine argue against the possibility of soft-constraints and moral objective functions with gradients.

The deontological optimization problem, on the other hand, is not susceptible to such critiques because the objective function is constant, and the satisfaction of constraints is a binary event.

I would also argue that the most hard-core utilitarian practically acts pretty similarly to a deontologist. The reason is that we only consider a tiny subspace of all possible decisions, and our estimate of the moral gradient will be highly inaccurate over most possible decision axis (I buy the computational-constraint critique), and its not clear that we have enough information about human experience in order to compute those gradients. So, practically speaking: we only consider a small number of different way to live our lives (hence we optimize over a limited range of axes) and the directions we optimize over is not-random for the most part. Think about how most activists and most individuals who perform any sort of advocacy focus on a single issue.

Also consider the fact that most people don't murder or perform certain forms of horrendous crimes. These single issue thinking, law-abiding types may not think of themselves as deontologist but a deontologist would behave very similarly to them since neither attempts to estimate moral gradients over decisions and treats many moral rules as binary events.

The utilitarian and the deontologist are distinguished in practice in that the utilitarian computes a noisy estimate of the moral gradient along a few axes of their potential decision-space: while everywhere else we think of hard constraints and no gradients on the moral objective. The pure utilitarian is at best a theoretical concept that has no potential basis in reality.

Comment by marks on Cultivating our own gardens · 2010-06-01T17:43:52.525Z · LW · GW

I would argue that deriving principles using the categorical imperative is a very difficult optimization problem and that there is a very meaningful sense in which one is a deontologist and not a utilitarian. If one is a deontologist then one needs to solve a series of constraint-satisfaction problems with hard constraints (i.e. they cannot be violated). In the Kantian approach: given a situation, one has to derive the constraints under which one must act in that situation via moral thinking then one must accord to those constraints.

This is very closely related to combinatorial optimization problems. I would argue that often there is a "moral dual" (in the sense of a dual program) where those constraints are no longer treated as absolute and you can assign different costs to each violation and you can then find a most moral strategy. I think very often we have something akin to strong duality where the utilitarian dual is equivalent to the deontological problem, but its an important distinction to remember that the deontologist has hard constraints and zero gradient on their objective functions (by some interpretations).

The utilitarian performs a search over a continuous space for the greatest expected utility, while the deontologist (in an extreme case) has a discrete set of choices, from which the immoral ones are successively weeded out.

Both are optimization procedures, and can be shown to produce very similar output behavior but the approach and philosophy are very different. The predictions of the behavior of the deontologist and the utilitarian can become quite different under the sorts of situations that moral philosophers love to come up with.

Comment by marks on Cultivating our own gardens · 2010-06-01T17:05:36.540Z · LW · GW

I agree with the beginning of your comment. I would add that the authors may believe they are attacking utilitarianism, when in fact they are commenting on the proper methods for implementing utilitarianism.

I disagree that attacking utilitarianism involves arguing for different optimization theory. If a utilitarian believed that the free market was more efficient at producing utility then the utilitarian would support it: it doesn't matter by what means that free market, say, achieved that greater utility.

Rather, attacking utilitarianism involves arguing that we should optimize for something else: for instance something like the categorical imperative. A famous example of this is Kant's argument that one should never lie (since it could never be willed to be a universal law, according to him), and the utilitarian philosopher loves to retort that lying is essential if one is hiding a Jewish family from the Nazis. But Kant would be unmoved (if you believe his writings), all that would matter are these universal principles.

Comment by marks on Diseased thinking: dissolving questions about disease · 2010-06-01T14:52:57.949Z · LW · GW

Bear in mind that having more fat means that the brain gets starved of (glucose)[http://www.loni.ucla.edu/~thompson/ObesityBrain2009.pdf] and blood sugar levels have (impacts on the brain generally)[http://ajpregu.physiology.org/cgi/content/abstract/276/5/R1223]. Some research has indicated that the amount of sugar available to the brain has a relationship with self-control. A moderately obese person may have fat cells that steal so much glucose from their brain that their brain is incapable of mustering the will in order to get them to stop eating poorly. Additionally, the marginal fat person is likely fat because of increased sugar consumption (which has been the main sort of food whose intake has increased since the origins of the obesity epidemic in the 1970s), in particular there has been a great increase in the consumption of fructose: which is capable of raising insulin levels (which signal to the body to start storing energy as fat) while at the same time not activating leptin (which makes you feel full). Thus, people are consuming this substance that may be kicking their bodies into full gear to produce more fat: which leaves them with no energy or will to perform any exercise.

The individuals most affected by the obesity epidemic are the poor and recall that some of the cheapest sources of calories available on the market are foods like fructose and processed meats. While there is a component of volition regardless, if the body works as the evidence suggests: they may have a diet that is pushing them quite hard towards being obese, sedentary, and unable to do anything about it.

Think about it this way, if you constantly wack me over the head you can probably get me to do all sorts of things that I wouldn't normally do: but it wouldn't be right to call my behavior in that situation "voluntary". Fat people may be in a similar situation.

Comment by marks on Cultivating our own gardens · 2010-06-01T04:11:36.545Z · LW · GW

I think that this post has something to say about political philosophy. The problem as I see it is that we want to understand how our local decision-making affects the global picture and what constraints should we put on our local decisions. This is extremely important because, arguably, people make a lot of local decisions that make us globally worse off: such as pollution ("externalities" in econo-speak). I don't buy the author's belief that we should ignore these global constraints: they are clearly important--indeed its the fear of the potential global outcomes of careless local decision-making that arguably led to the creation of this website.

However, just like a computers we have a lot of trouble integrating the global constraints into our decision-making (which is necessarily a local operation), and we probably have a great deal of bias in our estimates of what is the morally best set of choices for us to make. Just like the algorithm we would like to find some way to make the computational burden on us less in order to achieve these moral ends.

There is an approach in economics to understand social norms advocated by Herbert Gintis [PDF] that is able to analyze these sorts of scenarios. The essential idea is this: agents can engage in multiple correlated equilibria (these are a generalized version of Nash equilibria) possible as a result of various social norms. These correlated equilibria are, in a sense, patched together by a social norm from the "rational" (self-interested, local expected utility maximizers) agents' decisions. Human rights could definitely be understood in this light (I think: I haven't actually worked out the model).

Similar reasoning may also be used to understand certain types of laws and government policies. It is via these institutions (norms, human organizations, etc.) that we may efficiently impose global constraints on people's local decision-making. The karma system, for instance, on Less wrong probably changes the way that people make their decision to comment.

There is a probably a computer science - economics crossover paper here that would describe how institutions can lower the computational burden on individuals in their decision-making: so that when individuals make decisions in these simpler domains we can be sure that we will still be globally better off.

One word of caution is that this is precisely the rational behind "command economies" and these didn't work out so well during the 20th century. So choosing the "patching together" institution well is absolutely essential.

Comment by marks on Cultivating our own gardens · 2010-06-01T03:49:28.123Z · LW · GW

I think there is definitely potential to the idea, but I don't think you pushed the analogy quite far enough. I can see an analogy between what is presented here and human rights and to Kantian moral philosophy.

Essentially, we can think of human rights as being what many people believe to be an essential bare-minimum conditions on human treatment. I.e. that the class of all "good and just" worlds everybody's human rights will be respected. Here human rights corresponds to the "local rigidity" condition of the subgraph. In general, too, human rights are generally only meaningful for people one immediately interacts with in your social network.

This does simplify the question of just government and moral action in the world (as political philosophers are so desirous of using such arguments). I don't think, however, that the local conditions for human existence are as easy to specify as in the case of a sensor network graph.

In some sense there is a tradition largely inspired by Kant that attempts to do the moral equivalent of what you are talking about: use global regularity conditions (on morals) to describe local conditions (on morals: say the ability to will a moral decision to a universal law). Kant generally just assumed that these local conditions would achieve the necessary global requirements for morality (perhaps this is what he meant by a Kingdom of Ends). For Kant the local conditions on your decision-making were necessary and sufficient conditions for the global moral decision-making.

In your discussion (and in the approach of the paper), however, the local conditions placed (on morals or on each patch) are not sufficient to achieve the global conditions (for morality, or on the embedding). So its a weakening of the approach advanced by Kant. The idea seems to be that once some aspects (but not all) of the local conditions have been worked out one can then piece together the local decision rules into something cohesive.

Edit: I rambled, so I put my other idea into another commend

Comment by marks on Significance of Compression Rate Method · 2010-05-30T22:21:00.476Z · LW · GW

All the sciences mentioned above definitely do rely on controlled experimentation. But their central empirical questions are not amenable to being directly studied by controlled experimentation. We don't have multiple earths or natural histories upon which we can draw inference about the origins of species.

There is a world of difference between saying "I have observed speciation under these laboratory conditions" and "speciation explains observed biodiversity". These are distinct types of inferences. This of course does not mean that people who perform inference on natural history don't use controlled experiments: indeed they should draw on as much knowledge as possible about the mechanisms of the world in order to construct plausible theories of the past: but they can't run the world multiple times under different conditions to test their theories of the past in the way that we can test speciation.

Comment by marks on Significance of Compression Rate Method · 2010-05-30T22:09:34.881Z · LW · GW

I think we are talking past each other. I agree that those are experiments in a broad and colloquial use of the term. They aren't "controlled" experiments: which is a term that I was wanting to clarify (since I know a little bit about it). This means that they do not allow you to randomly assign treatments to experimental units which generally means that the risk of bias is greater (hence the statistical analysis must be done with care and the conclusions drawn should face greater scrutiny).

Pick up any textbook on statistical design or statistical analysis of experiments and the framework I gave will be what's in there for "controlled experimentation". There are other types of experiments. But these suffer from the problem that it can be difficult to sort out hidden causes. Suppose we want to know if the presence of A causes C (say eating meat causes heart disease). In an observational study we find units having trait A and those not (so find meat-eaters and vegetarians) and we then wait to observe response C. If we observe a response C in experimental units possessing trait A, its hard to know if A causes C or if there is some third trait B (present in some of the units) which causes both A and C.

In the case of a controlled experiment, A is now a treatment and not a trait of the units (so in this case you would randomly assign a carnivorous or vegetarian diet to people), thus we can randomly assign A to the units (and assume the randomization means that not every unit having hidden trait B will be given treatment A). In this case we might observe that A and C have no relation, whereas in the observational study we might. (For instance people who choose to be vegetarian may be more focused on health)

An example of how econometricians have dealt with "selection bias" or the fact that observation studies fail to have certain nice properties of controlled experiments is here

Comment by marks on Significance of Compression Rate Method · 2010-05-30T20:21:57.938Z · LW · GW

I think it's standard in the literature: "The word experiment is used in a quite precise sense to mean an investigation where the system under study is under the control of the investigator. This means that the individuals or material investigated, the nature of the treatments or manipulations under study and the measurement procedures used are all settled, in their important features at least, by the investigator." The theory of the design of experiments

To be sure there are geological experiments where one, say, takes rock samples and subjects various samples to a variety of treatments, in order to simulate potential natural processes. But there is another chunk of the science which is meant to describe the Earth's geological history and for a controlled experiment on that you would need to control the natural forces of the Earth and to have multiple Earths.

The reason why one needs to control an experiment (this is a point elaborated on at length in Cox and Reid) is in order to prevent bias. Take the hypothesis of continental drift. We have loads of "suspicious coincidences" that suggest continental drift (such as similar fossils on different landmasses, certain kinds of variations in the magnetic properties of the seafloor, the fact that the seafloor rocks are much younger than land rocks, earthquake patterns/fault-lines). Critically, however, we don't have an example of an earth that doesn't have continental drift. It is probably the case that some piece of "evidence" currently used to support the theory of continental drift will turn out to be a spurious correlation. Its very difficult to test for these because of the lack of control. The fact that we are almost certainly on a continental-drifting world biases us towards think that some geological phenomenon is caused by drift even when they not.

Comment by marks on Significance of Compression Rate Method · 2010-05-30T15:13:16.591Z · LW · GW

Those sciences are based on observations. Controlled experimentation requires that you have some set of experimental units to which you randomly assign treatments. With geology, for instance, you are trying to figure out the structure of the Earth's crust (mostly). There are no real treatments that you apply, instead you observe the "treatments" that have been applied by the earth to the earth. I.e. you can't decide which area will have a volcano, or an earthquake: you can't choose to change the direction of a plate or change the configuration of the plates: you can't change the chemical composition of the rock under large scale, etc.

All one can do is carefully collect measurements, build models of them, and attempt to create a cohesive picture that explains the phenomena. Control implies that you can do more than just collect measurements.

Comment by marks on Be a Visiting Fellow at the Singularity Institute · 2010-05-25T10:05:03.956Z · LW · GW

Bear in mind that the people who used steam engines to make money didn't make it by selling the engines: rather, the engines were useful in producing other goods. I don't think that the creators of a cheap substitute for human labor (GAI could be one such example) would be looking to sell it necessarily. They could simply want to develop such a tool in order to produce a wide array of goods at low cost.

I may think that I'm clever enough, for example, to keep it in a box and ask it for stock market predictions now and again. :)

As for the "no free lunch" business, while its true that any real-world GAI could not efficiently solve every induction problem, it wouldn't need to either for it to be quite fearsome. Indeed being able to efficiently solve at least the same set of induction problems that humans solve (particularly if its in silicon and the hardware is relatively cheap) is sufficient to pose a big threat (and be potentially quite useful economically).

Also, there is a non-zero possibility that there already exists a GAI and its creators, decided the safest, most lucrative, and beneficial thing to do is set the GAI on designing drugs: thereby avoiding giving the GAI too much information about the world. The creators could have then set up a biotech company that just so happens to produce a few good drugs now and again. Its kind of like how automated trading came from computer scientists and not the currently employed traders. I do think its unlikely that somebody working in medical research is going to develop GAI least of all because of the job threat. The creators of a GAI are probably going to be full time professionals who are are working on the project.

Comment by marks on Link: Strong Inference · 2010-05-23T23:05:20.264Z · LW · GW

Go to 1:00 minute here

"Building the best possible programs" is what he says.

Comment by marks on Link: Strong Inference · 2010-05-23T17:09:29.629Z · LW · GW

It actually comes from Peter Norvig's definition that AI is simply good software, a comment that Robin Hanson made: , and the general theme of Shane Legg's definitions: which are ways of achieving particular goals.

I would also emphasize that the foundations of statistics can (and probably should) be framed in terms of decision theory (See DeGroot, "Optimal Statistical Decisions" for what I think is the best book on the topic, as a further note the decision-theoretic perspective is neither frequentist nor Bayesian: those two approaches can be understood through decision theory). The notion of an AI as being like an automated statistician captures at least the spirit of how I think about what I'm working on and this requires fundamentally economic thinking (in terms of the tradeoffs) as well as notions of utility.

Comment by marks on Link: Strong Inference · 2010-05-23T17:00:18.708Z · LW · GW

The fact that there are so many definitions and no consensus is precisely the unclarity. Shane Legg has done us all a great favor by collecting those definitions together. With that said, his definition is certainly not the standard in the field and many people still believe their separate definitions.

I think his definitions often lack an understanding of the statistical aspects of intelligence, and as such they don't give much insight into the part of AI that I and others work on.

Comment by marks on Link: Strong Inference · 2010-05-23T03:48:17.573Z · LW · GW

I think there is a science of intelligence which (in my opinion) is closely related to computation, biology, and production functions (in the economic sense). The difficulty is that there is much debate as to what constitutes intelligence: there aren't any easily definable results in the field of intelligence nor are there clear definitions.

There is also the engineering side: this is to create an intelligence. The engineering is driven by a vague sense of what an AI should be, and one builds theories to construct concrete subproblems and give a framework for developing solutions.

Either way this is very different than astrophysics where one is attempting to: say, explain the motions of the heavenly sphere: which have a regularity, simplicity, and clarity to them that is lacking in any formulation of the AI problem.

I would say that AI researchers do formulate theories about how to solve particular engineering problems for AI systems, and then they test them out by programming them (hopefully). I suppose I count, and that's certainly what I and my colleagues do. Most papers in my fields of interest (machine learning and speech recognition) usually include an "experiments" section. I think that when you know a bit more about the actually problems AI people are solving you'll find that quite a bit of progress has been achieved since the 1960's.

Comment by marks on Chicago Meetup · 2010-05-22T05:02:03.732Z · LW · GW

I'd meet on June 6 (tentatively). South side is preferable if there are other people down here.

Comment by marks on Tips and Tricks for Answering Hard Questions · 2010-01-18T19:33:49.034Z · LW · GW

Thanks for the link assistance.

I agree that my mathematics example is insufficient to prove the general claim: "One will master only a small number of skills". I suppose a proper argument would require an in-depth study of people who solve hard problems.

I think the essential point of my claim is that there is high variance with respect to the subset of the population that can solve a given difficult problem. This seems to be true in most of the sciences and engineering to the best of my knowledge (though I know mathematics best). The theory I believe that explains why this variation occurs is that the subset of people which can solve a given problem use unconscious heuristics borne out of the hard work they put into previous problems over many years.

Admittedly, the problems I am thinking about are kind of like NP problems: it seems difficult to find a solution, but once a solution is found we can know it when we see it. There tends to be a large number of such problems that can be solved by only a small number of people. And the group of people that can solve them varies a lot from problem to problem.

There are also many hard problems for which it is hard to say what a good solution is (e.g. it seems difficult to evaluate different economic policies), or the "goodness" of a solution varies a lot with different value systems (e.g. abortion policy). It does seem that in these instances politicians claim they can give good answers to all the problems as do management consulting companies. Public intellectuals and pundits also seem to think they can give good answers to lots of questions as well. I suppose that if they are right then my claim is wrong. I argue that such individuals and organizations claim to be able to solve many problems but since its hard to verify the quality of the solutions we should take the claim with a grain of salt. We know that individuals who can solve lots of problems would have a lot of status so there is a clear incentive to claim to be able to solve problems that one cannot actually solve if verifying the solution is sufficiently costly.

I also think there is a good reason to think that even for those problems whose solutions are difficult to evaluate we should expect only a small number of people to actually give a good solution. The reason relates to a point made by Robin Hanson (and myself in another comment) which is that in solving a problem you should try to solve many at once. A good solution to a problem should give insight to many problems. Conversely, to understand and recognize a good solution to a given hard problem one should understand what it says about many other problems. The space of problems is too vast for any human being to know but a small portion, so I expect that people who are able to solve a given problem should only be those aware of many related problems and that most people will not be aware of the related problems. Given that in our civilization different people are exposed to different problems (no matter in which field they are employed) we should expect high variance of who can solve which hard problems.

Comment by marks on Tips and Tricks for Answering Hard Questions · 2010-01-18T03:03:53.352Z · LW · GW

Asking other people who have solved a similar problem to evaluate your answer is very powerful and simple strategy to follow.

Also, most evidence I have seen is that you can only learn how to do a small number of things well. So if you are solving something outside of your area of expertise (which probably includes most problems you'll encounter during your life) then there is probably somebody out there who can give a much better answer than you (although the cost to find such a person may be too great).

Post Note: The fact that you can only learn a few things really well seems to be true with mathematics: as in here. More generally, mastering a topic seems to take ten years or so [PDF] (see Edit below).

Edit: The software does not seem to allow for links that have parentheses, so you would need to copy the whole link--including the ".pdf" at the end--in order to actually pull up the document.

Edit Jan 18: Hex-escaped the parentheses so it should work better.

Comment by marks on Tips and Tricks for Answering Hard Questions · 2010-01-18T02:29:51.510Z · LW · GW

Expanding on the go meta point:

Solve many hard problems at once

Whatever solution you give to a hard problem should give insight or be consistent with answers given to other hard problems. This is similar in spirit to: "http://lesswrong.com/lw/1kn/two_truths_and_a_lie/" and a point made by Robin Hanson (Youtube link: the point is at 3:31) "...the first thing to do with puzzles is [to] try to resist the temptation to explain them one at a time. I think the right, disciplined way to deal puzzles is to collect a bunch of them: lay them all out on the table and find a small number of hypotheses that can explain a large number of puzzles at once."

His point as I understand was that people often narrowly focus on a limited number of health-related puzzles and that we could produce better policy if we attempted to attack many puzzles at once (consider things such as fear of death, the need to show we care, status-regulation, human social dynamics: particularly signaling loyalty).

Edit: I had originally meant to point out that solving several problems is a meta-thought about solutions to problems: i.e. they should relate to solutions to other problems

Comment by marks on High Status and Stupidity: Why? · 2010-01-12T17:42:45.411Z · LW · GW

From: You and Your Research

When you are famous it is hard to work on small problems. This is what did Shannon in. After information theory, what do you do for an encore? The great scientists often make this error. They fail to continue to plant the little acorns from which the mighty oak trees grow. They try to get the big thing right off. And that isn't the way things go. So that is another reason why you find that when you get early recognition it seems to sterilize you.

Here is another mechanism by which status could make you "stupid", although I'm interpreting stupid in a different sense: as in making one less productive than one otherwise might. Although, I think the critique could be more general.

Its generally only worth talking about things that we can make progress in understanding so if you have an inflated sense of what you can accomplish then you might try to think about and discuss things that you cannot advance. So you end up wasting your mental efforts more and you fall behind on other areas that would have been a better use of your talents.

Comment by marks on Two Truths and a Lie · 2009-12-28T20:02:31.259Z · LW · GW

I think that it should be tested on our currently known theories, but I do think it will probably perform quite well. This is on the basis that its analogically similar to cross validation in the way that Occam's Razor is similar to the information criteria (Aikake, Bayes, Minimum Description Length, etc.) used in statistics.

I think that, in some sense, its the porting over of a statistical idea to the evaluation of general hypotheses.

Comment by marks on Two Truths and a Lie · 2009-12-28T07:45:12.485Z · LW · GW

I think this is cross-validation for tests. There have been several posts on Occam's Razor as a way to find correct theories, but this is the first I have seen on cross-validation.

In machine learning and statistics, a researcher often is trying to find a good predictor for some data and they often have some "training data" on which they can use to select the predictor from a class of potential predictors. Often one has more than one predictor that performs well on the training data so the question is how else can one choose an appropriate predictor.

One way to handle the problem is to use only a class of "simple predictors" (I'm fudging details!) and then use the best one: that's Occam's razor. Theorists like this approach and usually attach the word "information" to it. The other "practitioner" approach is use a bigger class of predictors where you tune some of the parameters on one part of the data and tune other parameters (often hyper-parameters if you know the jargon) on a separate part of the data. That's the cross-validation approach.

There's some results on the asymptotic equivalence of the two approaches. But, what's cool about this post is that I think it offers a way to apply cross-validation to an area where I have never heard it discussed (I think, in part, because its the method of the practitioner and not so much the theorist--there are exceptions of course!)

Comment by marks on Bloggingheads: Yudkowsky and Aaronson talk about AI and Many-worlds · 2009-08-18T15:27:55.925Z · LW · GW

I would like to see more discussion on the timing of artificial super intelligence (or human level intelligence). I really want to understand the mechanics of your disagreement.

Comment by marks on Bayesian Flame · 2009-08-05T06:00:25.696Z · LW · GW

One issue with say taking a normal distribution and letting the variance go to infinity (which is the improper prior I normally use) is that the posterior distribution distribution is going to have a finite mean, which may not be a desired property of the resulting distribution.

You're right that there's no essential reason to relate things back to the reals, I was just using that to illustrate the difficulty.

I was thinking about this a little over the last few days and it occurred to me that one model for what you are discussing might actually be an infinite graphical model. The infinite bi-directional sequence here are the values of bernoulli-distributed random variables. Probably the most interesting case for you would be a Markov-random field, as the stochastic 'patterns' you were discussing may be described in terms of dependencies between random variables.

Here's three papers I read a little while back on the topic (and related to) something called an Indian Buffet process: (http://www.cs.utah.edu/~hal/docs/daume08ihfrm.pdf) (http://cocosci.berkeley.edu/tom/papers/ibptr.pdf) (http://www.cs.man.ac.uk/~mtitsias/papers/nips07.pdf)

These may not quite be what you are looking for since they deal with a bound on the extent of the interactions, you probably want to think about probability distributions of binary matrices with an infinite number of rows and columns (which would correspond to an adjacency matrix over an infinite graph).

Comment by marks on Bayesian Flame · 2009-08-05T05:42:21.267Z · LW · GW

No problem.

Improper priors are generally only considered in the case of continuous distributions so 'sum' is probably not the right term, integrate is usually used.

I used the term 'weight' to signify an integral because of how I usually intuit probability measures. Say you have a random variable X that takes values in the real line, the probability that it takes a value in some subset S of the real line would be the integral of S with respect to the given probability measure.

There's a good discussion of this way of viewing probability distributions in the wikipedia article. There's also a fantastic textbook on the subject that really has made a world of difference for me mathematically.

Comment by marks on Open Thread: August 2009 · 2009-08-01T19:09:14.343Z · LW · GW

I think you're making an important point about the uncertainty of what impact our actions will have. However, I think the right way to about handling this issue is to put a bound on what impacts of our actions are likely to be significant.

As an extreme example, I think I have seen much evidence that clapping my hands once right now will have essentially no impact on the people living in Tripoli. Very likely clapping my hands will only affect myself (as no one is presently around) and probably in no huge way.

I have not done a formal statistical model to assess the significance, but I can probably state the significance is relatively low. If we can analyze what events are significant or not causally for others then we would certainly make the moral inference problem much simpler.

Comment by marks on Open Thread: August 2009 · 2009-08-01T16:33:44.081Z · LW · GW

There's another issue too, which is that it is extraordinarily complicated to assess what the ultimate outcome of particular behavior is. I think this opens up a statistical question of what kinds of behaviors are "significant", in the sense that if you are choosing between A and B, is it possible to distinguish A and B or are they approximately the same.

In some cases they won't be, but I think that in very many they would.

Comment by marks on Bayesian Flame · 2009-07-29T04:11:17.605Z · LW · GW

What topology are you putting on this set?

I made the point about the real numbers because it shows that putting a non-informative prior on the infinite bidirectional sequences should be at least as hard as for the real numbers (which is non-trivial).

Usually a regularity is defined in terms of a particular computational model, so if you picked Turing machines (or the variant that works with bidirectional infinite tape, which is basically the same class as infinite tape in one direction), then you could instead begin constructing your prior in terms of Turing machines. I don't know if that helps any.

Comment by marks on Bayesian Flame · 2009-07-29T04:02:44.532Z · LW · GW

You can actually simulate a tremendous number of distributions (and theoretically any to an arbitrary degree of accuracy) by doing an approximate inverse CDF applied to a standard uniform random variable see here for example. So the space of distributions from which you could select to do your test is potentially infinite. We can then think of your selection of a probability distribution as being a random experiment and model your selection process using a probability distribution.

The issue is that since the outcome space is the space of all computable probability distributions Bayesians will have consistency problems (another good paper on the topic is here), i.e. the posterior distribution won't converge to the true distribution. So in this particular set up I think Bayesian methods are inferior unless one could devise a good prior over what distributions, I suppose if I knew that you didn't know how to sample from arbitrary probability distributions then if I put that in my prior then I may be able to use Bayesian methods to successfully estimate the probability distribution (the discussion of the Bayesian who knew you personally was meant to be tongue-in-cheek).

In the frequentist case there is a known procedure due to Parzen from the 60's .

All of these are asymptotic results, however, your experiment seems to be focused on very small samples. To the best of my knowledge there aren't many results in this case except under special conditions. I would state that without more constraints on the experimental design I don't think you'll get very interesting results. Although I am actually really in favor of such evaluations because people in statistics and machine learning for a variety of reasons don't do them, or don't do them on a broad enough scale. Anyway if you actually are interested in such things you may want to start looking here, since statistics and machine learning both have the tools to properly design such experiments.

Comment by marks on Bayesian Flame · 2009-07-28T15:49:46.132Z · LW · GW

In finite dimensional parameter spaces sure, this makes perfect sense. But suppose that we are considering a stochastic process X1, X2, X3, .... where Xn is follows a distribution Pn over the integers. Now put a prior on the distribution and suppose that unbeknown to you Pn is the distribution that puts 1/2 probability weight on -n and 1/2 probability weight on n. If the prior on the stochastic process does not put increasing weight on integers with large absolute value, then in the limit the prior puts zero probability weight on the true distribution (and may start behaving strangely quite early on in the process).

Another case is that the true probability model may be too complicated to write down or computationally infeasible to do so (say a Gaussian mixture with 10^(10) mixture components, which is certainly reasonable in a modern high-dimensional database), so one may only consider probability distributions that approximate the true distribution and put zero weight on the true model, i.e. it would be sensible in that case to have a prior that may put zero weight on the true model and you would search only for an approximation.

Comment by marks on Bayesian Flame · 2009-07-28T07:23:17.578Z · LW · GW

There's a difficulty with your experimental setup in that you implicitly are invoking a probability distribution over probability distributions (since you represent a random choice of a distribution). The results are going to be highly dependent upon how you construct your distribution over distributions. If your outcome space for probability distributions is infinite (which is what I would expect), and you sampled from a broad enough class of distributions then a sampling of 25 data points is not enough data to say anything substantive.

A friend of yours who knows what distributions you're going to select from, though, could incorporate that knowledge into a prior and then use that to win.

So, I predict that for your setup there exists a Bayesian who would be able to consistently win.

But, if you gave much more data and you sampled from a rich enough set of probability distributions that priors would become hard to specify a frequentist procedure would probably win out.

Comment by marks on Bayesian Flame · 2009-07-28T07:06:50.406Z · LW · GW

I think what Shalizi means is that a Bayesian model is never "wrong", in the sense that it is a true description of the current state of the ideal Bayesian agent's knowledge. I.e., if A says an event X has probability p, and B says X has probability q, then they aren't lying even if p!=q. And the ideal Bayesian agent updates that knowledge perfectly by Bayes' rule (where knowledge is defined as probability distributions of states of the world). In this case, if A and B talk with each other then they should probably update, of course.

In frequentist statistics the paradigm is that one searches for the 'true' model by looking through a space of 'false' models. In this case if A says X has probability p and B says X has probability q != p then at least one of them is wrong.

Comment by marks on Bayesian Flame · 2009-07-28T06:40:29.620Z · LW · GW

I suppose it depends what you want to do, first I would point out that the set is in a bijection with the real numbers (think of two simple injections and then use Cantor–Bernstein–Schroeder), so you can use any prior over the real numbers. The fact that you want to look at infinite sequences of 0s and 1s seems to imply that you are considering a specific type of problem that would demand a very particular meaning of 'non-informative prior'. What I mean by that is that any 'noninformative prior' usually incorporates some kind of invariance: e.g. a uniform prior on [0,1] for a Bernoulli distribution is invariant with respect to the true value being anywhere in the interval.

Comment by marks on Bayesian Flame · 2009-07-28T06:33:00.159Z · LW · GW

This isn't always the case if the prior puts zero probability weight on the true model. This can be avoided on finite outcome spaces, but for infinite outcome spaces no matter how much evidence you have you may not overcome the prior.

Comment by marks on Bayesian Flame · 2009-07-28T06:27:26.400Z · LW · GW

I've had some training in Bayesian and Frequentist statistics and I think I know enough to say that it would be difficult to give a "simple" and satisfying example. The reason is that if one is dealing with finite dimensional statistical models (this is where the parameter space of the model is finite) and one has chosen a prior for those parameters such that there is non-zero weight on the true values then the Bernstein-von Mises theorem guarantees that the Bayesian posterior distribution and the maximum likelihood estimate converge to the same probability distribution (although you may need to use improper priors). The covers cases where we consider finite outcomes such as a toss of a coin or rolling a die.

I apologize if that's too much jargon, but for really simple models that are easy to specify you tend to get the same answer. Bayesian stats starts to behave different than frequentist statistics in noticeable ways when you consider infinite outcome spaces. An example here might be where you are considering probability distributions over curves (this arises in my research on speech recognition). In this case even if you have a seemingly sensible prior you can end up in the case where, in the limit of infinite data, you will end up with a posterior distribution that is different from the true distribution.

In practice if I am learning a Gaussian Mixture Model for speech curves and I don't have much data then Bayesian procedures tend to be a bit more robust and frequentist procedures end up over-fitting (or being somewhat random). When I start getting more data using frequentist methods tend to be algorithmically more tractable and get better results. So I'll end with faster computation time and say on the task of phoneme recognition I'll make fewer errors.

I'm sorry if I haven't explained it well, the difference in performance wasn't really evident to me until I spent some time actually using them in machine learning. Unfortunately, most of the disadvantage of Bayesian approaches aren't evident for simple statistical problems, but they become all too evident in the case of complex statistical models.

Comment by marks on AndrewH's observation and opportunity costs · 2009-07-25T17:38:07.730Z · LW · GW

I am uneasy with that sentiment although I'm having a hard time putting my finger one exactly why. But this is how I see it: there are vastly more people in the world than I could possibly ever help and some of them are so poor and downtrodden that they spend most of their money on food since they can't afford luxuries such as drugs. Eventually, I might give money to the drug user if I had solved all the other problems first, but I would prefer my money to be spent on something more essential for survival first before I turn to subsidizing people's luxury spending.

Imposing my values on somebody seems to more aptly describe a situation where I use authority to compel the drug user to not use drugs.

Comment by marks on AndrewH's observation and opportunity costs · 2009-07-25T17:06:17.081Z · LW · GW

Would a simple solution to this be to say plan a date each year to give away some quantity of money? You could keep a record of all the times you gave money to a beggar, or you could use a simple model to estimate how much you probably would have given, then you can send that amount to a worthwhile charity.

When I get more money that's what I plan on doing.

Comment by marks on Sayeth the Girl · 2009-07-20T02:11:43.594Z · LW · GW

Also, I'd like to note that the post here included nigh-Yudkowskian levels of cross-linking to other material on LW. When we're talking about "conversation norms on LW", how is that not solid data?

The evidence presented is a number of anecdotes from LW conversation. A fully analysis of LW would need to categorize different types of offending comments, discuss their frequency and what role they play in LW discussion. Even better would be to identify who does them, etc.

Although I do find it plausible that LW should enact a policy of altering present discussions of gender seems I certainly will not say the evidence presented is "overwhelming".