Posts

Thoughts on the good regulator theorem 2022-08-11T12:08:28.897Z
JonasMoss's Shortform 2022-03-02T14:07:24.916Z
Ordinary and unordinary decision theory 2022-03-02T11:39:30.522Z

Comments

Comment by JonasMoss on On infinite ethics · 2022-04-19T18:35:57.197Z · LW · GW

The number of elements in  won't change when removing every other element from it. The cardinality of   is countable. And when you remove every other element, it is still countable, and indistinguishable from .  If you're unconvinced, ask yourself how many elements  with every other element removed contains. The set is certainly not larger than , so it's at most countable. But it's certainly not finite either. Thus you're dealing with a set of countably many 0s. As there is only one such multiset,  equals  with every other element removed.

That there is only one such multiset follows from the definition of a multiset, a set of pairs , where  is an element and  is its cardinality. It would also be true if we define multisets using sets containing all the pairs  -- provided we ignore the identity of each pair. I believe this is where our disagreement lies. I ignore identities, working only with sets. I think you want to keep the identities intact. If we keep the identities, the set  is not equal to , and my argument (as it stands) fails. 

Comment by JonasMoss on On infinite ethics · 2022-04-19T13:59:09.874Z · LW · GW

I don't understand what you mean. The upgraded individuals are better off than the non-upgraded individuals, with everything else staying the same, so it is an application of Pareto.

Now, I can understand the intuition that (a) and (b) aren't directly comparable due to identity of individuals. That's what I mean with the caveat "(Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)"

Comment by JonasMoss on On infinite ethics · 2022-04-19T12:25:08.608Z · LW · GW

Pareto: If two worlds (w1 and w2) contain the same people, and w1 is better for an infinite number of them, and at least as good for all of them, then w1 is better than w2.

As far as I can see, the Pareto principle is not just incompatible with the agent-neutrality principle, it's incompatible with set theory itself. (Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)

Let's take a look at, for instance,  vs , where  is the multiset containing  and  is the disjoint union. Now consider the following scenarios:

(a) Start out with  and multiply every utility by  to get . Since infinitely many people are better off and no one is worse off, .

(b) Start out with  and take every other of the -utilities from  and change them to . Since a copy of  is still left over, this operation leaves us with . Again, since infinitely many are better off and no one worse off, .

In conclusion, both  and , a contradiction.

Comment by JonasMoss on On infinite ethics · 2022-04-19T12:24:33.098Z · LW · GW
Comment by JonasMoss on A Bayesian Aggregation Paradox · 2022-03-08T09:38:21.890Z · LW · GW

Okay, thanks for the clarification! Let's see if I understand your setup correctly. Suppose we have the probability measures and , where is the probability measure of the expert. Moreover, we have an outcome

In your post, you use , where is an unknown outcome known only to the expert. To use Bayes' rule, we must make the assumption that . This assumption doesn't sound right to be, but I suppose some strange assumption is necessary for this simple framework. In this model, I agree with your calculations.

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and necessarily loses some information.

I'm not sure. When we're looking directly at the probability of an event (instead of the probability of the probability an event), things get much simpler than I thought.

Let's see what happens to the likelihood when you aggregate from the expert's point of view. Letting , we need to calculate the expert's likelihoods and . In this case,

which is essentially your calculations, but from the expert's point of view. The likelihood depends on , the prior of the expert, which is unknown to you. That shouldn't come as a surprise, as he needs to use the prior of in order to combine the probability of the events and .

But the calculations are exactly the same from your point of view, leading to

Now, suppose we want to generally ensure that . Which is what I believe you want to do, and which seems pretty natural to do, at least since we're allowed to assume that for all simple events . To ensure this, we will probably have to require that your priors are the same as the expert. In other words, your joint distributions are equal, or .

Do you agree with this summary?

Comment by JonasMoss on Harms and possibilities of schooling · 2022-03-07T20:34:51.277Z · LW · GW

Do you have a link to the research about the effect of a bachelor of education?

Comment by JonasMoss on A Bayesian Aggregation Paradox · 2022-03-07T20:30:12.841Z · LW · GW

I find the beginning of this post somewhat strange, and I'm not sure your post proves what you claim it does. You start out discussing what appears to be a combination of two forecasts, but present it as Bayesian updating. Recall that Bayes theorem says . To use this theorem, you need both an  (your data / evidence), and a  (your parameter). Using “posterior prior  likelihood” (with priors  and likelihoods ), you're talking as if your expert's likelihood equals  – but is that true in any sense? A likelihood isn't just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

I can see two interpretations of what you're doing at the beginning of your post:

  1. You're combining two forecasts. That is, with  being the outcome, you have your own pmf  and the expert's , then combine them using . That's fair enough, but I suppose  or maybe  for some  would be a better way to do it.
  2. It might be possible to interpret your calculations as a proper application of Bayes' rule, but that requires stretching it. Suppose  is your subjective probability vector for the outcomes  and  is the subjective probability vector for the event supplied by an expert (the value of  is unknown to us). To use Bayes' rule, we will have to say that the evidence vector , the probability of observing an expert judgment of  given that  is true. I'm not sure we ever observe such quantities directly, and it is pretty clear from your post that you're talking about  in the sense used above, not .

Assuming interpretation 1, the rest of your calculations are not that interesting, as you're using a method of knowledge pooling no one advocates.

Assuming interpretation 2, the rest of your calculations are probably incorrect. I don't think there is a unique way to go from to, let's say, , where  is the expert's probability vector over  and  your probability vector over .

Comment by JonasMoss on Harms and possibilities of schooling · 2022-03-07T19:04:01.282Z · LW · GW

Children became grown-ups 200 years ago too. I don't think we need to teach them anything at all, much less anything in particular.

According to this SSC post, kids can easily catch up in math even if they aren't taught any math at all in the 5 first years of school.

In the Benezet experiment, a school district taught no math at all before 6th grade (around age 10-11). Then in sixth grade, they started teaching math, and by the end of the year, the students were just as good at math as traditionally-educated children with five years of preceding math education.

That would probably work for reading too, I guess. (Reading appears to require more purpose-built brain circuitry than math. At least I got that impression from reading Henrich's WEIRD. I don't have any references though.)

Comment by JonasMoss on Magna Alta Doctrina · 2022-03-07T17:52:43.984Z · LW · GW

I found this post interesting, especially the first part, but extremely difficult to understand (yeah, that hard). I believe some of the analogies might be valuable, but it's simply too hard for me to confirm / disconfirm most of them. Here are some (but far from all!) examples:

1. About local optimizers. I didn't understand this section at all! Are you claiming that gradient descent isn't a local optimizer? Or are you claiming that neural networks can implement mesa-optimizers? Or something else?

2. The analogy to Bayesian reasoning feels forced and unrelated to your other points in the Bayes section. Moreover, Bayesian statistics typically doesn't work (it's inconsistent) when you ignore the normalizing constant. And in the case of neural networks, what is your prior? Unless you're thinking about approximate priors using weight decay, most neural networks do not employ priors on their parameters.

3. In your linear model, you seem to interpret the maximum likelihood estimator of the parameters as a Bayesian estimator. Am I on the right track here?

4. Building on your linear toy model, it is natural to understand the weight decay parameters as priors, as that is what they are. (In an exact sense; with L2 weight decay you're looking at ridge regression, which is a linear regression with normal priors on the parameters. L1 weights with Laplace priors, etc.) But you don't do that. In what sense is "the bayesian prior could be encoded purely in the initial weight distribution." What's more, it seems to me you're thinking about the learning rate as your prior. I think this has something do to with your interpretation of the linear model maximum likelihood estimator as a Bayesian procedure...?

Comment by JonasMoss on Ordinary and unordinary decision theory · 2022-03-04T20:57:06.755Z · LW · GW

I disagree. Sometimes your entire payoffs also change when you change your action space (in the informal description of the problem). That is the point of the last example, where precommitment changes the possible payoffs, not only restricts the action space.

Comment by JonasMoss on Ordinary and unordinary decision theory · 2022-03-03T09:29:58.708Z · LW · GW

Paradoxical decision problems are paradoxical in the colloquial sense (such as Hilbert's hotel or Bertrand's paradox), not the literal sense (such as "this sentence is false"). Paradoxicality is in the eye of the beholder. Some people think Newcomb's problem is paradoxical, some don't. I agree with you and don't find it paradoxical.

Comment by JonasMoss on JonasMoss's Shortform · 2022-03-02T16:26:29.702Z · LW · GW

Ah! Edited version: "there's no *obvious* distribution " (which could have been "natural distribution" or "canonical distribution"). The point is that you need more information than what should be sufficient (the effect of the action) to do evidential decision theory.

Comment by JonasMoss on JonasMoss's Shortform · 2022-03-02T14:07:25.192Z · LW · GW

Evidential decision theory boggles my mind.

I have some sympathy for causal decision theory, especially when the causal description matches reality. But evidential decision theory is 100% bonkers.

The most common argument against evidential decision theory is that it does not care about the consequence of your action. It cares about correlation (broadly speaking), not causality, and acts as if both were same. This argument is sufficient to thoroughly discredit evidential decision theory, but philosophers keep giving it screen time.

Even if we lived in a world where correlation and causality were always the same (if that is possible), evidential decision theory would be wrong. Why? Because evidential decision theory requires distributions over actions and outcomes.

When you're acting in a decision problem, your action will often, or even usually, be unique. No one has every done that kind of action before. Consequently, there's no obvious distribution  over the action a and outcome x. But evidential decision theory requires such a distribution to function! Now you'll have to bootstrap your way to a distribution , flexing your philosophical creativity muscles. I suppose you could make this equal to , the actual outcome when doing action a, at least when  is deterministic. But why? You'll just introduce probabilities where none are needed.

Comment by JonasMoss on Do, Then Think · 2022-02-23T19:18:04.302Z · LW · GW

Just like this classic!  https://slatestarcodex.com/2014/03/24/should-you-reverse-any-advice-you-hear/

Comment by JonasMoss on Theses on Sleep · 2022-02-23T13:41:51.084Z · LW · GW

About that paper.

The p-values relevant for testosterone are on the lower side, with one them 0.049 (which screams p-hacking) and another at 0.02 (also really shitty). A reasonable back-of-the-envelope method to correct for p-hacking and publication bias involves multiplying the p-values with 20 (the reasoning is not super-involved. think about what happens to the truncated normal distribution in the case of complete publication bias); in that case, none of the testosterone-related p-values in said paper are significant. I feel comfortable ignoring it.

Comment by JonasMoss on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-19T07:35:12.366Z · LW · GW

It's a game, just a trivial one. Snakes and Ladders is also a game, and its payoff matrix is similar to this one, just with a little bit of randomness involved.

My intuition says that this game not only has maximal alignment, but is the only game (up to equivalence) game with maximal alignment for any set of strategies . No matter what player 1 and player 2 does, the world is as good as it could be.

The case can be compared to the  when the variance of the dependent variable is 0. How much of the variance in the dependent variable does the independent variable explain in this case? It'd say it's all of it. 

Comment by JonasMoss on Variables Don't Represent The Physical World (And That's OK) · 2021-06-18T10:10:17.119Z · LW · GW

This reminds me of the propensity of social scientists to drop inference when studying the entire population, claiming that confidence intervals do not make any sense when we have every single existing data point. But confidence intervals do make sense even then, as the entire observed population isn't equal to the theoretical population. The observed population does not give us exact knowledge about any properties of the data generating mechanism, except in edge cases. 

(Not that confidence intervals are very useful when looking at linear regressions with millions of data points anyway, but make sure to have your justification right.)

Comment by JonasMoss on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-18T08:41:31.723Z · LW · GW

I believe the upper right-hand corner of  shouldn't be 1; even if both players are acting in each other's best interest, they are not acting in their own best interest. And alignment is about having both at the same time. The configuration of Prisoner's dilemma makes it impossible to make both players maximally satisfied at the same time, so I believe it cannot have maximal alignment for any strategy. 

Anyhow, your concept of alignment might involve altruism only, which is fair enough. In that case, Vanessa Kosoy has a similar proposal to mine, but not working with sums, which probably does exactly what you are looking for.

Getting alignment in the upper right-hand corner in the Prisoner's dilemma matrix to be 1 may be possible if we redefine  to , the best attainable payoff sum. But then zero-sum games will have maximal instead of minimal alignment! (This is one reason why I defined .) 
 

(Btw, the coefficient isn't symmetric; it's only symmetric for symmetric games. No alignment coefficient depending on the strategies can be symmetric, as the vectors can have different lengths.)






 

Comment by JonasMoss on Open problem: how can we quantify player alignment in 2x2 normal-form games? · 2021-06-17T13:18:00.757Z · LW · GW

Alright, here comes a pretty detailed proposal! The idea is to find out if the sum of expected utility for both players is “small” or “large” using the appropriate normalizers.

First, let's define some quantities. (I'm not overly familiar with game theory, and my notation and terminology are probably non-standard. Please correct me if that's the case!)

  •  The payoff matrix for player 1.
  •  The payoff matrix for player 2.
  •  the mixed strategies for players 1 and 2. These are probability vectors, i.e., vectors of non-negative numbers summing to 1.

Then the expected payoff for player 1 is the bilinear form  and the expected payoff for player 2 is . The sum of payoffs is 

But we're not done defining stuff yet. I interpret alignment to be about welfare. Or how large the sum of utilities is when compared to the best-case scenario and the worst-case scenario. To make an alignment coefficient out of this idea, we will need

  •   This is the lower bound to the sum of payoffs, , where  are probability vectors. Evidentely, 
  •  The upper bound to the sum of payoffs in the counterfactual situation where the payoff to player 1 is not affected by the actions of player 2, and vice versa. Then . Now we find that .

Now define the alignment coefficient of the strategies  in the game defined by the payoff matrices  as 

The intuition is that alignment quantifies how the expected payoff sum  compares to the best possible payoff sum  attainable when the payoffs are independent. If they are equal, we have perfect alignment . On the other hand, if , the expected payoff sum is as bad as it could possibly be, and we have minimal alignment (). 

The only problem is that  makes the denominator equal to 0; but in this case,  as well, which I believe means that defining  is correct. (It's also true that, but I don't think this matters too much. The players get the best possible outcome no matter how they play, which deserves .) This is an extreme edge case, as it only holds for the special payoff matrices  () that contain the same element  () in every cell. 

Let's look at some properties:

  • A pure coordination game has at least one maximal alignment equilibrium, i.e.,  for some . All of these are necessarily Nash equilibria.
  • A zero-sum game (that isn't game-theoretically equivalent to the 0 matrix) has  for every pair of strategies . This is because  for every . The total payoff is always the worst possible.
  • The alignment coefficient is linear in a specific senst, i.e.,  where  is the matrix consisting of only s.

Now let's take a look at a variant of the Prisoner's dilemma with joint payoff matrix

Then 

The alignment coefficient at  is

Assuming pure strategies, we find the following matrix of alignment, where  is the alignment when player 1 plays  with certainty and player 2 plays  with certainty.

Since is the only Nash equilibrium, the “alignment at rationality” is 0. By taking convex combinations, the range of alignment coefficients is .

Some further comments:

  • Any general alignment coefficient probably has to be a function of , as we need to allow them to vary when doing game theory. 
    • Specialized coefficients would only report the alignment at Nash equilibria, maybe the maximal Nash equilibrium.
    • One may report the maximal alignment without caring about equilibrium points, but then the strategies do not have to be in equilibrium, which I am uneasy with. The maximal alignment for the Prisoner's dilemma is 1/2, but does this matter? Not if we want to quantify the tendency for rational actors to maximize their total utility, at least.
  • Using e.g. the correlation between the payoffs is not a good idea, as it implicitly assumes the uniform distribution on . And why would you do that?