## Posts

## Comments

**AlexMennen**on Jimrandomh's Shortform · 2021-07-21T07:00:39.884Z · LW · GW

Examples 3 and 1 are justified by the VNM theorm and Harsanyi's utilitarian theorem, respectively. I agree that 2 and 4 are wrong.

**AlexMennen**on Troll Bridge · 2021-07-18T20:27:21.530Z · LW · GW

Not good enough. If I know that the bridge isn't safe, and that I'm not going to cross it, then I know that the bridge is safe iff I'm going to cross it, but not crossing it is still the correct decision.

**AlexMennen**on Troll Bridge · 2021-07-12T22:58:03.833Z · LW · GW

Suppose the bridge is safe iff there's a proof that the bridge is safe.

Then you can prove the bridge is safe without any reference to your own actions.

**AlexMennen**on Your Dog is Even Smarter Than You Think · 2021-05-01T17:57:29.344Z · LW · GW

This sort of thing seems to suggest that EY's claims in this post about the scale of the relative intelligence differences between chimps, a village idiot, and Einstein is incorrect. The difference in intelligence between village idiot and Einstein may be comparable to the difference in intelligence between some nonhuman animals and a human village idiot. Which is a priori surprising, given that human brains are very structurally similar to each other in comparison to nonhuman animal brains.

**AlexMennen**on Dutch-Booking CDT: Revised Argument · 2021-04-11T22:12:33.654Z · LW · GW

I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don't think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price if , even if is not the optimal action (let's say is the optimal action). When the time comes to take an action, the agent's best bet is (prime meaning sell the contract for price ). The way I described the set-up, the agent doesn't choose between and , because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn't a Dutch book, because the case where the agent's top choice goes through is the case in which the contract is worthless, and the contract's value is derived from other cases.

We could modify the epsilon exploration assumption so that the agent also chooses between and even while its top choice is . That is, there's a lower bound on the probability with which the agent takes an action in , but even if that bound is achieved, the agent still has some flexibility in distributing probability between and . In this case, contrary to your argument, the agent will prefer rather than , i.e., it will not get Dutch booked. This is because the agent is still choosing as the only action with high probability, and refers to the expected consequence of the agent choosing as its intended action, so the agent cannot use when calculating which of or is better to pick as its next choice if its attempt to implement intended action fails.

Another source of uncertainty that the agent could have about its actions is if it believes it could gain information in the future, but before it has to make a decision, and this information could be relevant to which decision it makes. Say that and are the agent's expectations at time of the utility that taking action would cause it to get, and the utility it would get conditional on taking action , respectively. Suppose the bookie offers the deal at time , and the agent must act at time . If the possibility of gaining future knowledge is the only source of the agent's uncertainty about its own decisions, then at time , it knows what action it is taking, and is undefined on actions not taken. and should both be well-defined, but they could be different. The problem description should disambiguate between them. Suppose that every time you say and in the description of the contract, this means and , respectively. The agent purchases the contract, and then, when it comes time to act, it evaluates consequences by , not , so the argument for why the agent will inevitably resell the contract fails. If the appearing in the description of the contract instead means (since the agent doesn't know what that is yet, this means the contract references what the agent will believe in the future, rather than stating numerical payoffs), then the agent won't purchase it in the first place because it will know that the contract will only have value if seems to be suboptimal at time and it takes action anyway, which it knows won't happen, and hence the contract is worthless.

**AlexMennen**on "Infra-Bayesianism with Vanessa Kosoy" – Watch/Discuss Party · 2021-03-28T20:49:18.835Z · LW · GW

The Nirvana trick seems like a cheap hack, and I'm curious if there's a way to see it as good reasoning.

One response to this was that predicting Nirvana in some circumstance is equivalent to predicting that there are no possible futures in that circumstance, which is a sensible thing to say as a prediction that that circumstance is impossible.

**AlexMennen**on What's So Bad About Ad-Hoc Mathematical Definitions? · 2021-03-18T00:25:13.189Z · LW · GW

That's exactly what I was trying to say, not a disagreement with it. The only step where I claimed all reasonable ways of measuring spreadout-ness agree was on the result you get after summing up a large number of iid random variables, not the random variables that were being summed up.

**AlexMennen**on What's So Bad About Ad-Hoc Mathematical Definitions? · 2021-03-16T18:29:20.764Z · LW · GW

The "or any other measure of spreadout-ness" can be dropped here

What I meant is that, if you restrict attention to normal distributions with a fixed mean, then any reasonable measure of how spread out it is (including any of the E[|x-mean|^p]) will be a sufficient statistic, because any such measure, in order to be reasonable, must increase as variance increases (for normal distributions), so this function can be inverted to recover the variance. In other words, any other such measure will indeed be isomorphic to variance when restricted to normal distributions.

The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant

This does not change the minimizer of E[|X-m|] because it increases E[|X-m|] by the same amount for every m>min(X).

In general, you can't decrease E[|X-m|] by moving m from median to median-d for d>0 because, for xmedian (half the distribution), you increase |X-m| by d, and for the other half, you decrease |X-m| by at most d.

**AlexMennen**on What's So Bad About Ad-Hoc Mathematical Definitions? · 2021-03-16T05:03:03.549Z · LW · GW

Variance has more motivation than just that it's a measure of how spread out the distribution is. Variance has the property that if two random variables are independent, then the variance of their sum is the sum of their variances. By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness). Since it is the variance of the distributions you were adding together that determines this, variance is exactly the thing you care about if you want to know the degree of spreadout-ness of a sum of a large number of independent variables from the distribution. If you take any measure of how spread out a distribution is that doesn't carry the same information as the variance, then it will fail to predict how spread out the sum of a large number of independent copies of the distribution is, by any measure.

Edit: On the subject of other possible measures of features of probability distributions, one could also make the same complaint about mean as a measure of the middle of a distribution, when there are possible alternatives like median. Again, a similar sort of argument can be used to identify mean as the best one in some circumstances. But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way), so this measures the minimal average distance every point in the distribution has to travel in order for them to all meet at one point (the median is the most efficient point for them to meet).

**AlexMennen**on I'm still mystified by the Born rule · 2021-03-09T07:33:27.379Z · LW · GW

I was just thinking back to this, and it occurred to me that one possible reason to be unsatisfied with the arguments I presented here is that I started off with this notion of a crossing-over point as p continuously increases. But then when you asked "ok, but why is the crossing-over point 2?", I was like "uh, consider that it might be an integer, and then do a bunch of very discrete-looking arguments that end up showing there's something special about 2", which doesn't connect very well with the "crossover point when p continuously varies" picture. If indeed this seemed unsatisfying to you, then perhaps you'll like this more:

If we have a norm on a vector space, then it induces a norm on its dual space, given by . If a linear map preserves a norm, then its adjoint preserves the induced norm on the dual space.

Claim: The Lp norm on column vectors induces, as its dual, the Lq norm on row vectors, where p and q satisfy .

Thus if a matrix preserves Lp norm, then its adjoint preserves Lq norm. When p=2, we get that its adjoint preserves the same norm. This sort of gives you a natural way of seeing 2 as halfway between 1 and infinity, and giving, for every p, a corresponding q that is equally far away from the middle in the other direction, in the appropriate sense.

Proof of claim: Given p and q such that , and a row vector with Lq norm 1, let , so that . Then let (with the same sign as ). The column vector has Lp norm 1. . This shows that the dual-Lp norm of is at least 1. Standard constrained optimization techniques will verify that this maximizes subject to the constraint that has Lp norm 1, and thus that the dual-Lp norm of is exactly 1.

Corollary: If a matrix preserves Lp norm for any p2, then it is a permutation matrix (up to flipping the signs of some of its entries).

Proof: Let q be such that . The columns of the matrix each have Lp norm 1, so the whole matrix has Lp norm (since the entries from each of the n columns contribute 1 to the sum). By the same reasoning about its adjoint, the matrix has Lq norm . Assume wlog p<q. Lq norm is Lp norm for q>p, with equality only on scalar multiples of basis vectors. So if any column of the matrix isn't a basis vector (up to sign), then its Lq norm is less than 1; meanwhile, all the columns have Lq norm at most 1, so this would mean that the Lq norm of the whole matrix is strictly less than , contradicting the argument about its adjoint.

**AlexMennen**on I'm still mystified by the Born rule · 2021-03-05T08:47:23.066Z · LW · GW

Also, I'm curious what you think the connection is between the "L2 is connected to bilinear forms" and "L2 is the only Lp metric invariant under nontrivial change of basis", if it's easy to state.

This was what I was trying to vaguely gesture towards with the derivation of the "transpose = inverse" characterization of L2-preserving matrices; the idea was that the argument was a natural sort of thing to try, so if it works to get us a characterization of the Lp-preserving matrices for exactly one value of p, then that's probably the one that has a different space of Lp-preserving matrices than the rest. But perhaps this is too sketchy and mysterian. Let's try a dimension-counting argument.

Linear transformations and bilinear forms can both be represented with matrices. Linear transformations act on the space of bilinear forms by applying the linear transformation to both inputs before plugging them into the bilinear form. If the matrix represents a linear transformation and the matrix represents a bilinear form, then the matrix representing the bilinear form you get from this action is . But whatever, the point is, so far we have an -dimensional group acting on an -dimensional space. But quadratic forms (like the square of the L2 norm) can be represented by *symmetric* matrices, the space of which is -dimensional, and if is symmetric, then so is . So now we have an -dimensional group acting on a -dimensional space, so the stabilizer of any given element must be at least dimensional. As it turns out, this is exactly the dimensionality of the space of orthogonal matrices, but the important thing is that this is nonzero, which explains why the space of orthogonal matrices must not be discrete.

Now let's see what happens if we try to adapt this argument to Lp and p-linear forms for some p2.

With p=1, a linear transformation preserving a linear functional corresponds to a matrix preserving a row vector in the sense that . You can do a dimension-counting argument and find that there are tons of these matrices for any given row vector, but it doesn't do you any good because 1 isn't even so preserving the linear functional doesn't mean you preserve L1 norm.

Let's try p=4, then. A 4-linear form can be represented by an hypermatrix, the space of which is -dimensional. Again, we can restrict attention to the symmetric ones, which are preserved by the action of linear maps. But the space of symmetric hypermatrices is -dimensional, still much more than . This means that our linear maps can use up all of their degrees of freedom moving a symmetric 4-linear form around to different 4-linear forms without even getting close to filling up the whole space, and never gets forced to use its surplus degrees of freedom with linear maps that stabilize a 4-linear form, so it doesn't give us linear maps stabilizing L4 norm.

**AlexMennen**on I'm still mystified by the Born rule · 2021-03-05T03:50:40.014Z · LW · GW

A related thing that's special about the L2 norm is that there's a bilinear form such that |v| carries the same information as .

"Ok, so what? Can't do you the same thing with any integer n, with an n-linear form?" you might reasonably ask. First of all, not quite, it only works for the even integers, because otherwise you need to use absolute value*, which isn't linear.

But the bilinear forms really are the special ones, roughly speaking because they are a similar type of object to linear transformations. By currying, a bilinear form on V is a linear map , where is the space of linear maps . Now the condition of a linear transformation preserving a bilinear form can just be written in terms of chaining linear maps together. A linear map has an adjoint given by for , and a linear map preserves a bilinear form iff . When using coordinates in an orthonormal basis, the bilinear form is represented by the identity matrix, so if is represented by the matrix , this becomes , which is where the usual definition of an orthogonal matrix comes from. For quadrilinear forms etc, you can't really do anything like this. So it's L2 for which you get a way of characterizing "norm-preserving" in a nice clean linear-algebraic-in-character way, so it makes sense that that would be the one to have a different space of norm-preserving maps than the others.

I also subtly brushed past something that makes L2 a particularly special norm, although I guess it's not clear if it helps. A nondegenerate bilinear form is the same thing as an isomorphism between and . If is always positive, then taking its square root gives you a norm, and that norm is L2 (though it may be disguised if you weren't using an orthonormal basis); and if it isn't always positive, then you don't get a norm out of it at all. So L2 is unique among all possible norms in that it induces and comes from an identification between your vector space and its dual.

*This assumes your vector space is over for simplicity. If it's over , then you can't get multilinearity no matter what you do, and the way this argument has to go is that you can get close enough by taking the complex conjugate of exactly half of the inputs, and then you get multilinearity from there. Speaking of , this reminds me that I was inappropriately assuming your vector space was over in my previous comment. Over , you can multiply basis vectors by any scalar of absolute value 1, not just +1 and -1. This is broader that the norm-preserving changes of basis you can do over to exactly the extent explicable by the fact that you're sneaking in a little bit of L2 via the definition of the absolute value of a complex number.

**AlexMennen**on I'm still mystified by the Born rule · 2021-03-04T21:25:05.137Z · LW · GW

is the L2 norm preferred b/c it's the only norm that's invariant under orthonormal change of basis, or is the whole idea of orthonormality somehow baking in the fact that we're going to square and sqrt everything in sight (and if so how)

The L2 norm is the only Lp norm that can be preserved by *any* non-trivial change of basis (the trivial ones: permuting basis elements and multiplying some of them by -1). This follows from the fact that, for p2, the basis elements are their negatives can be identified just from the Lp norm and the addition and scalar multiplication operations of the vector space. To intuitively gesture at why this is so, let's look at L1 and L.

In L1, the norm of the sum of two vectors is the sum of their norms iff for each coordinate, both vectors have components of the same sign; otherwise, they cancel in some coordinate, and the norm of the sum is smaller than the sum of the norms. 0 counts as the same sign as everything, so the more zeros a vector has in its coordinates, the more other vectors it will have the maximum possible norm of sum with. The basis vectors and their negations are thus distinguished as those unit vectors u for which the set {v : |u+v| = |u|+|v|} is maximal. Since the alternative to |u+v| = |u|+|v| is |u+v| < |u|+|v|, the basis vectors can be thought of as having maximal tendency for their sums with other vectors to have large norm.

In L, on the other hand, as long as you're keeping the largest coordinate fixed, changing the other coordinates costs nothing in terms of the norm of the vector, but making those other coordinates larger still creates more opportunities to change the norm of other vectors when you add them together. So if you're looking for a unit vector u that minimizes {v : |u+v| |v|}, u is a basis vector or the negation of one. The basis vectors have minimal tendency for their sums with other vectors to have large norm.

As p increases, the tendency for basis vectors to have large sums with other vectors decreases (as compared to the tendency for arbitrary vectors to have large sums with other vectors). There must be a cross-over point where whether or not a vector is a basis vector ceases to be predictive of the norm of its sum with an arbitrary other vector, and we lose the ability to figure out which vectors are basis vectors only at that point, which is p=2.

So if you're trying to guess what sort of norm some vector space naturally carries (let's say you're given, as a hint, that it's an Lp norm for some p), L2 should start out as a pretty salient option, along with, and arguably ahead of, L1 and L. As soon as you hear anything about there being multiple different bases that seem to have equal footing (as is saliently the case in QM), that settles it: L2 is the only option.

**AlexMennen**on I'm still mystified by the Born rule · 2021-03-04T08:09:58.342Z · LW · GW

I disagree that using the latter to generate a sensory stream from a quantum state yields reasonable predictions -- eg, taken literally I think you're still zeroing out all but a measure-zero subset of the position basis

The observation you got from your sample is information. Information is entropy, and entropy is locally finite. So I don't think it's possible for the states consistent with the observation you got from your sample to have measure zero.

**AlexMennen**on Utility Maximization = Description Length Minimization · 2021-02-22T08:12:02.462Z · LW · GW

I don't see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

**AlexMennen**on Superintelligence via whole brain emulation · 2020-09-21T19:58:17.190Z · LW · GW

All that is indeed possible, but not guaranteed. The reason I was speculating that better brain imaging wouldn't be especially useful for machine learning in the absence of better neuron models is that I'd assume that the optimization pressure that went into the architecture of brains was fairly heavily tailored to the specific behavior of the neurons that those brains are made of, and wouldn't be especially useful relative to other neural network design techniques that humans come up with when used with artificial neurons that behave quite differently. But sure, I shouldn't be too confident of this. In particular, the idea of training ML systems to imitate brain activation patterns, rather than copying brain architecture directly, is a possible way around this that I hadn't considered.

**AlexMennen**on Superintelligence via whole brain emulation · 2020-09-19T23:44:27.313Z · LW · GW

No. Scanning everything and then waiting until we have a good enough neuron model might work fine; it's just that the scan wouldn't give you a brain emulation until your neuron model is good enough.

**AlexMennen**on An overview of 11 proposals for building safe advanced AI · 2020-06-19T05:10:16.939Z · LW · GW

For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

**AlexMennen**on Intuitive Lagrangian Mechanics · 2020-06-15T07:28:46.674Z · LW · GW

I'm confused about the motivation for in terms of time dilation in general relativity. I was under the impression that general relativity doesn't even have a notion of gravitational potential, so I'm not sure what this would mean. And in Newtonian physics, potential energy is only defined up to an added constant. For to represent any sort of ratio (including proper time/coordinate time), V would have to be well-defined, not just up to an arbitrary added constant.

I also had trouble figuring out the relationship between the Euler-Lagrange equation and extremizing S. The Euler-Lagrange equation looks to me like just a kind of funny way of stating Newton's second law of motion, and I don't see why it should be equivalent to extremizing action. Perhaps this would be obvious if I knew some calculus of variations?

**AlexMennen**on Relaxed adversarial training for inner alignment · 2020-06-09T08:22:08.359Z · LW · GW

I'm concerned about Goodhart's law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

**AlexMennen**on An overview of 11 proposals for building safe advanced AI · 2020-06-08T06:17:38.239Z · LW · GW

Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you've already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it's just affecting the efficiency with which putting resources towards training leads to good performance. Separating competitiveness into training and performance competitiveness would make sense if there's a fixed amount of training that must be done to achieve any reasonable performance at all, but past that, more training is not effective at producing better performance. My impression is that this isn't usually what happens.

**AlexMennen**on Scott Garrabrant's problem on recovering Brouwer as a corollary of Lawvere · 2020-05-04T18:33:53.650Z · LW · GW

Let α be the least countable ordinal such that there is no polynomial-time computable recursive well-ordering of length α.

, which makes the claim you made about it vacuous.

Proof: Let be any computable well-ordering of . Let be the number of steps it takes to compute whether or not . Let (notice I'm using the standard ordering on , so this is the maximum of a finite set, and is thus well-defined). is computable in time. Let be a bijective pairing function such that both the pairing function and its inverse are computable in polynomial time. Now let be the well-ordering of given by , if is not for any , and if neither nor is of the form for any . Then is computable in polynomial time, and the order type of is plus the order type of , which is just the same as the order type of if that order type is at least .

The fact that you said you think is makes me suspect you were thinking of the least countable ordinal such that there is no recursive well-ordering of length that can be proven to be a recursive well-ordering in a natural theory of arithmetic such that, for every computable function, there's a program computing that function that the given theory can prove is total iff there's a program computing that function in polynomial time.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-20T01:41:36.830Z · LW · GW

This makes Savage a better comparison point, since the Savage axioms are more similar to the VNM framework while also trying to construct probability and utility together with one representation theorem.

Sure, I guess I just always talk about VNM instead of Savage because I never bothered to learn how Savage's version works. Perhaps I should.

As a representation theorem, this makes VNM weaker and JB stronger: VNM requires stronger assumptions (it requires that the preference structure include information about all these probability-distribution comparisons), where JB only requires preference comparison of events which the agent sees as real possibilities.

This might be true if we were idealized agents who do Bayesian updating perfectly without any computational limitations, but as it is, it seems to me that the assumption that there is a fixed prior is unreasonably demanding. People sometimes update probabilities based purely on further thought, rather than empirical evidence, and a framework in which there is a fixed prior which gets conditioned on events, and banishes discussion of any other probability distributions, would seem to have some trouble handling this.

Doesn't pointless topology allow forsomedistinctions which aren't meaningful in pointful topology, though?

Sure, for instance, there are many distinct locales that have no points (only one of which is the empty locale), whereas there is only one ordinary topological space with no points.

Isn't the approach you mention pretty close to JB? You're not modeling the VNM/Savage thing of arbitrary gambles; you're just assigning values (and probabilities) to events, like in JB.

Assuming you're referring to "So a similar thing here would be to treat a utility function as a function from some lattice of subsets of (the Borel subsets, for instance) to the lattice of events", no. In JB, the set of events is the domain of the utility function, and in what I said, it is the codomain.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-20T01:18:38.288Z · LW · GW

In the Savage framework, an outcome already encodes everything you care about.

Yes, but if you don't know which outcome is the true one, so you're considering a probability distribution over outcomes instead of a single outcome, then it still makes sense to speak of the probability that the true outcome has some feature. This is what I meant.

So the computation which seems to be suggested by Savage is to think of these maximally-specified outcomes, assigning them probability and utility, and then combining those to get expected utility. This seems to be very demanding: it requires imagining these very detailed scenarios.

You do not need to be able to imagine every possible outcome individually in order to think of functions on or probability distributions over the set of outcomes, any more than I need to be able to imagine each individual real number in order to understand the function or the standard normal distribution.

It seems that you're going by an analogy like Jeffrey-Bolker : VNM :: events : outcomes, which is partially right, but leaves out an important sense in which the correct analogy is Jeffrey-Bolker : VNM :: events : probability distributions, since although utility is defined on outcomes, the function that is actually evaluated is expected utility, which is defined on probability distributions (this being a distinction that does not exist in Jeffrey-Bolker, but does exist in my conception of real-world human decision making).

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-12T03:06:53.639Z · LW · GW

I agree that the considerations you mentioned in your example are not changes in values, and didn't mean to imply that that sort of thing is a change in values. Instead, I just meant that such shifts in expectations are changes in probability distributions, rather than changes in events, since I think of such things in terms of how likely each of the possible outcomes are, rather than just which outcomes are possible and which are ruled out.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-10T04:55:56.801Z · LW · GW

It seems to me that the Jeffrey-Bolker framework is a poor match for what's going on in peoples' heads when they make value judgements, compared to the VNM framework. If I think about how good the consequences of an action are, I try to think about what I expect to happen if I take that action (ie the outcome), and I think about how likely that outcome is to have various properties that I care about, since I don't know exactly what the outcome will be with certainty. This isn't to say that I literally consider probability distributions in my mind, since I typically use qualitative descriptions of probability rather than numbers in [0,1], and when I do use numbers, they are very rough, but this does seem like a sort of fuzzy, computationally limited version of a probability distribution. Similarly, my estimations of how good various outcomes are are often qualitative, rather than numerical, and again this seems like a fuzzy, computationally limited version of utility function. In order to determine the utility of the event "I take action A", I need to consider how good and how likely various consequences are, and take the expectation of the 'how good' with respect to the 'how likely'. The Jeffrey-Bolker framework seems to be asking me to pretend none of that ever happened.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-09T04:09:37.498Z · LW · GW

Say I have a computer that will simulate an arbitrary Turing machine T, and will award me one utilon when that machine halts, and do nothing for me until that happens. With some clever cryptocurrency scheme, this is a scenario I could actually build today.

No, you can't do that today. You could produce a contraption that will deposit 1 BTC into a certain bitcoin wallet if and when some computer program halts, but this won't do the wallet's owner much good if they die before the program halts. If you reflect on what it means to award someone a utilon, rather than a bitcoin, I maintain that it isn't obvious that this is even possible in theory.

Why in the world would one expect a utility function over an uncountable domain to be computable?

There is a notion of computability in the continuous setting.

As far as I can see, the motivation for requiring a utility function to be computable is that this would make optimization for said utility function to be a great deal easier.

This seems like a strawman to me. A better motivation would be that agents that actually exist are computable, and a utility function is determined by judgements rendered by the agent, which is incapable of thinking uncomputable thoughts.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-09T03:09:18.120Z · LW · GW

I think we're going to have to back up a bit. Call the space of outcomes and the space of Turing machines . It sounds like you're talking about two functions, and . I was thinking of as the utility function we were talking about, but it seems you were thinking of .

You suggested should be computable but should not be. It seems to me that should certainly be computable (with the caveat that it might be a partial function, rather than a total function), as computation is the only thing Turing machines do, and that if non-halting is included in a space of outcomes (so that is total), it should be represented as some sort of limit of partial information, rather than represented explicitly, so that is continuous.

In any case, a slight generalization of Rice's theorem tells us that any computable function from Turing machines to reals that depends only of the machine's semantics must be constant, so I suppose I'm forced to agree that, if we want a utility function that is defined on all Turing machines and depends only on their semantics, then at least one of or should be uncomputable. But I guess I have to ask why we would want to assign utilities to Turing machines.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-08T23:48:48.572Z · LW · GW

It's not clear to me what this means in the context of a utility function.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-08T05:08:40.698Z · LW · GW

I'm not sure what it would mean for a real-valued function to be enumerable. You could call a function enumerable if there's a program that takes as input and enumerates the rationals that are less than , but I don't think this is what you want, since presumably if a Turing machine halting can generate a positive amount of utility that doesn't depend on the number of steps taken before halting, then it could generate a negative amount of utility by halting as well.

I think accepting the type of reasoning you give suggests that limit-computability is enough (ie there's a program that takes and produces a sequence of rationals that converges to , with no guarantees on the rate of convergence). Though I don't agree that it's obvious we should accept such utility functions as valid.

**AlexMennen**on An Orthodox Case Against Utility Functions · 2020-04-08T02:27:47.306Z · LW · GW

we need not assume there are "worlds" at all. ...In mathematics, it brings to mind pointless topology.

I don't think the motivation for this is quite the same as the motivation for pointless topology, which is designed to mimic classical topology in a way that Jeffrey-Bolker-style decision theory does not mimic VNM-style decision theory. In pointless topology, a continuous function of locales is a function from the lattice of open sets of to the lattice of open sets of . So a similar thing here would be to treat a utility function as a function from some lattice of subsets of (the Borel subsets, for instance) to the lattice of events.

My understanding of the Jeffrey-Bolker framework is that its primary difference from the VNM framework is not its pointlessness, but the fact that it comes with a prior probability distribution over outcomes, which can only be updated by conditioning on events (i.e. updating on evidence that has probability 1 in some worlds and probability 0 in the rest). VNM does not start out with a prior, and allows any probability distribution over outcomes to be compared to any other, and Jeffrey-Bolker only allows comparison of probability distributions obtained by conditioning the prior on an event. Of course, this interpretation requires a fair amount of reading between the lines, since the Jeffrey-Bolker axioms make no explicit mention of any probability distribution, but I don't see any other reasonable way to interpret them, since if asked which of two events is better, I will often be unable to answer without further information, since the events may contain worlds of widely varying utility. Associating an event with a fixed prior conditioned on the event gives me this additional information needed to answer the question, and I don't see how any others could work. Starting with a prior that gets conditioned on events that correspond to the agent's actions seems to build in evidential decision theory as an assumption, which makes me suspicious of it.

In the Jeffrey-Bolker treatment, a world is just a maximally specific event: an event which describes everything completely. But there is no requirement that maximally-specific events exist.

This can be resolved by defining worlds to be minimal non-zero elements of the completion of the Boolean algebra of events, rather than a minimal non-zero event. This is what you seemed to be implicitly doing later with the infinite bitstrings example, where the events were clopen subsets of Cantor space (i.e. sets of infinite bitstrings such that membership in the set only depends on finitely many bits), and this Boolean algebra has no minimal non-zero elements (maximally-specific events), but the minimal non-zero elements of its completion correspond to infinite bitstrings, as desired.

**AlexMennen**on Might humans not be the most intelligent animals? · 2019-12-24T23:26:43.753Z · LW · GW

I guess what I was trying to say is (although I think I've partially figured out what you meant; see next paragraph), cultural evolution is a process that acquires adaptations slowly-ish and transmits previously-acquired adaptations to new organisms quickly, while biological evolution is a process that acquires adaptations very slowly and transmits previously-acquired adaptations to new organisms quickly. You seem to be comparing the rate at which cultural evolution acquires adaptations to the rate at which biological evolution transmits previously-acquired adaptations to new organisms, and concluding that cultural evolution is slower.

Re-reading the part of your post where you talked about AI takeoff speeds, you argue (which I hadn't understood before) that the rise of humans was fast on an evolutionary timescale, and slow on a cultural timescale, so that if it was due to an evolutionary change, it must involve a small change that had a large effect on capabilities, so that a large change will occur very suddenly if we mimic evolution quickly, while if it was due to a cultural change, it was probably a large change, so mimicking culture quickly won't produce a large effect on capabilities unless it is extremely quick.

This clarifies things, but I don't agree with the claim. I think slow changes in the intelligence of a species is compatible with fast changes in its capabilities even if the changes are mainly in raw innovative ability rather than cultural learning. Innovations can increase ability to innovate, causing a positive feedback loop. A species could have high enough cultural learning ability for innovations to be transmitted over many generations without having the innovative ability to ever get the innovations that will kick off this loop. Then, when they start slowing gaining innovative ability, the innovations accumulated into cultural knowledge gradually increase, until they reach the feedback loop and the rate of innovation becomes more determined by changes in pre-existing innovations than by changes in raw innovative ability. There doesn't even have to be any evolutionary changes in the period in which innovation rate starts to get dramatic.

If you don't buy this story, then it's not clear why the changes being in cultural learning ability rather than in raw innovative ability would remove the need for a discontinuity. After all, our cultural learning ability went from not giving us much advantage over other animals to "accumulating decisive technological dominance in an evolutionary eyeblink" in an evolutionary eyeblink (quotation marks added for ease of parsing). Does this mean our ability to learn from culture must have greatly increased from a small change? You argue in the post that there's no clear candidate for what such a discontinuity in cultural learning ability could look like, but this seems just as true to me for raw innovative ability.

Perhaps you could argue that it doesn't matter if there's a sharp discontinuity in cultural learning ability because you can't learn from a culture faster than the culture learns things to teach you. In this case, yes, perhaps I would say that AI-driven culture could make advancements that look discontinuous on a human scale. Though I'm not entirely sure what that would look like, and I admit it does sound kind of soft-takeoffy.

**AlexMennen**on Might humans not be the most intelligent animals? · 2019-12-24T20:46:50.182Z · LW · GW

The abilities we obtained from architectural changes to our brains also came from a slow, accumulated process, taking even longer than cultural evolution does.

**AlexMennen**on Might humans not be the most intelligent animals? · 2019-12-24T18:58:10.860Z · LW · GW

There's more than one thing that you could mean by raw innovative capacity separate from cultural processing ability. First, you could mean someone's ability to innovate on their own without any direct help from others on the task at hand, but where they're allowed to use skills that they previously acquired from their culture. Second, you could mean someone's counterfactual ability to innovate on their own if they hadn't learned from culture. You seem to be conflating these somewhat, though mostly focusing on the second?

The second is underspecified, as you'd need to decide what counterfactual upbringing you're assuming. If you compare the cognitive performance of a human raised by bears to the cognitive performance of a bear in the same circumstances, this is unfair to the human, since the bear is raised in circumstances that it is adapted for and the human is not, just like comparing the cognitive performance of a bear raised by humans to that of a human in the same circumstances would be unfair to the bear. Though a human raised by non-humans would still make a more interesting comparison to non-human animals than Genie would, since Genie's environment is even less conducive to human development (I bet most animals wouldn't cognitively develop very well if they were kept immobilized in a locked room until maturity either).

I think this makes the second notion less interesting than the first, as there's a somewhat arbitrary dependence on the counterfactual environment. I guess the first notion is more relevant when trying to reason specifically on genetics as opposed to other factors that influence traits, but the second seems more relevant in other contexts, since it usually doesn't matter to what extent someone's abilities were determined by genetics or environmental factors.

I didn't really follow your argument for the relevance of this question to AI development. Why should raw innovation ability be more susceptible to discontinuous jumps than cultural processing ability? Until I understand the supposed relevance to AI better, it's hard for me to say which of the two notions is more relevant for this purpose.

I'd be very surprised if any existing non-human animals are ahead of humans by the first notion, and there's a clear reason in this case why performance would correlate strongly with social learning ability: social learning will have helped people in the past develop skills that they keep in the present. Even for the second notion, though it's a bit hard to say without pinning down the counterfactual more closely, I'd still expect humans to outperform all other animals in some reasonable compromise environment that helps both develop but doesn't involve them being taught things that the non-humans can't follow. I think there are still reasons to expect social learning ability and raw innovative capability to be correlated even in this sense, because higher general intelligence will help for both; original discovery and understanding things that are taught to you by others both require some of the same cognitive tools.

**AlexMennen**on AlexMennen's Shortform · 2019-12-08T04:51:10.077Z · LW · GW

Theorem: Fuzzy beliefs (as in https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v ) form a continuous DCPO. (At least I'm pretty sure this is true. I've only given proof sketches so far)

The relevant definitions:

A fuzzy belief over a set is a concave function such that (where is the space of probability distributions on ). Fuzzy beliefs are partially ordered by . The inequalities reverse because we want to think of "more specific"/"less fuzzy" beliefs as "greater", and these are the functions with lower values; the most specific/least fuzzy beliefs are ordinary probability distributions, which are represented as the concave hull of the function assigning 1 to that probability distribution and 0 to all others; these should be the maximal fuzzy beliefs. Note that, because of the order-reversal, the supremum of a set of functions refers to their pointwise infimum.

A DCPO (directed-complete partial order) is a partial order in which every directed subset has a supremum.

In a DCPO, define to mean that for every directed set with , such that . A DCPO is continuous if for every , .

Lemma: Fuzzy beliefs are a DCPO.

Proof sketch: Given a directed set , is convex, and . Each of the sets in that intersection are non-empty, hence so are finite intersections of them since is directed, and hence so is the whole intersection since is compact.

Lemma: iff is contained in the interior of and for every such that , .

Proof sketch: If , then , so by compactness of and directedness of , there should be such that . Similarly, for each such that , there should be such that . By compactness, there should be some finite subset of such that any upper bound for all of them is at least .

Lemma: .

Proof: clear?

**AlexMennen**on TurnTrout's shortform feed · 2019-11-26T23:37:25.091Z · LW · GW

The part about derivatives might have seemed a little odd. After all, you might think, is a discrete set, so what does it mean to take derivatives of functions on it. One answer to this is to just differentiate symbolically using polynomial differentiation rules. But I think a better answer is to remember that we're using a different metric than usual, and isn't discrete at all! Indeed, for any number , , so no points are isolated, and we can define differentiation of functions on in exactly the usual way with limits.

**AlexMennen**on TurnTrout's shortform feed · 2019-11-26T23:28:49.532Z · LW · GW

The theorem: where is relatively prime to an odd prime and , is a square mod iff is a square mod and is even.

The real meat of the theorem is the case (i.e. a square mod that isn't a multiple of is also a square mod . Deriving the general case from there should be fairly straightforward, so let's focus on this special case.

Why is it true? This question has a surprising answer: Newton's method for finding roots of functions. Specifically, we want to find a root of , except in instead of .

To adapt Newton's method to work in this situation, we'll need the p-adic absolute value on : for relatively prime to . This has lots of properties that you should expect of an "absolute value": it's positive ( with only when ), multiplicative (), symmetric (), and satisfies a triangle inequality (; in fact, we get more in this case: ). Because of positivity, symmetry, and the triangle inequality, the p-adic absolute value induces a metric (in fact, ultrametric, because of the strong version of the triangle inequality) . To visualize this distance function, draw giant circles, and sort integers into circles based on their value mod . Then draw smaller circles inside each of those giant circles, and sort the integers in the big circle into the smaller circles based on their value mod . Then draw even smaller circles inside each of those, and sort based on value mod , and so on. The distance between two numbers corresponds to the size of the smallest circle encompassing both of them. Note that, in this metric, converges to .

Now on to Newton's method: if is a square mod , let be one of its square roots mod . ; that is, is somewhat close to being a root of with respect to the p-adic absolute value. , so ; that is, is steep near . This is good, because starting close to a root and the slope of the function being steep enough are things that helps Newton's method converge; in general, it might bounce around chaotically instead. Specifically, It turns out that, in this case, is exactly the right sense of being close enough to a root with steep enough slope for Newton's method to work.

Now, Newton's method says that, from , you should go to . is invertible mod , so we can do this. Now here's the kicker: , so . That is, is closer to being a root of than is. Now we can just iterate this process until we reach with , and we've found our square root of mod .

Exercise: Do the same thing with cube roots. Then with roots of arbitrary polynomials.

**AlexMennen**on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-02T19:51:47.196Z · LW · GW

The impressive part is getting reinforcement learning to work at all in such a vast state space

It seems to me that that is AGI progress? The real world has an even vaster state space, after all. Getting things to work in vast state spaces is a necessary pre-condition to AGI.

**AlexMennen**on What are we assuming about utility functions? · 2019-10-03T16:54:44.047Z · LW · GW

Ok, I see what you mean about independence of irrelevant alternatives only being a real coherence condition when the probabilities are objective (or otherwise known to be equal because they come from the same source, even if there isn't an objective way of saying what their common probability is).

But I disagree that this makes VNM only applicable to settings in which all sources of uncertainty have objectively correct probabilities. As I said in my previous comment, you only need there to exist some source of objective probabilities, and you can then use preferences over lotteries involving objective probabilities and preferences over related lotteries involving other sources of uncertainty to determine what probability the agent must assign for those other sources of uncertainty.

Re: the difference between VNM and Bayesian expected utility maximization, I take it from the word "Bayesian" that the way you're supposed to choose between actions does involve first coming up with probabilities of each outcome resulting from each action, and from "expected utility maximization", that these probabilities are to be used in exactly the way the VNM theorem says they should be. Since the VNM theorem does not make any assumptions about where the probabilities came from, these still sound essentially the same, except with Bayesian expected utility maximization being framed to emphasize that you have to get the probabilities somehow first.

**AlexMennen**on What are we assuming about utility functions? · 2019-10-03T06:10:34.388Z · LW · GW

I think you're underestimating VNM here.

only two of those four are relevant to coherence. The main problem is that the axioms relevant to coherence (acyclicity and completeness) do not say anything at all about probability

It seems to me that the independence axiom is a coherence condition, unless I misunderstand what you mean by coherence?

correctly point out problems with VNM

I'm curious what problems you have in mind, since I don't think VNM has problems that don't apply to similar coherence theorems.

VNM utility stipulates that agents have preferences over "lotteries" with known, objective probabilities of each outcome. The probabilities are assumed to be objectively known from the start. The Bayesian coherence theorems do not assume probabilities from the start; they derive probabilities from the coherence criteria, and those probabilities are specific to the agent.

One can construct lotteries with probabilities that are pretty well understood (e.g. flipping coins that we have accumulated a lot of evidence are fair), and you can restrict attention to lotteries only involving uncertainty coming from such sources. One may then get probabilities for other, less well-understood sources of uncertainty by comparing preferences involving such uncertainty to preferences involving easy-to-quantify uncertainty (e.g. if A is preferred to B, and you're indifferent between 60%A+40%B and "A if X, B if not-X", then you assign probability 60% to X. Perhaps not quite as philosophically satisfying as deriving probabilities from scratch, but this doesn't seem like a fatal flaw in VNM to me.

I do not expect agent-like systems in the wild to be pushed toward VNM expected utility maximization. I expect them to be pushed toward Bayesian expected utility maximization.

I understood those as being synonyms. What's the difference?

**AlexMennen**on [AN #66]: Decomposing robustness into capability robustness and alignment robustness · 2019-10-02T00:58:31.657Z · LW · GW

I do, however, believe that the single step cooperate-defect game which they use to come up with their factors seems like a very simple model for what will be a very complex system of interactions. For example, AI development will take place over time, and it is likely that the same companies will continue to interact with one another. Iterated games have very different dynamics, and I hope that future work will explore how this would affect their current recommendations, and whether it would yield new approaches to incentivizing cooperation.

It may be difficult for companies to get accurate information about how careful their competitors are being about AI safety. An iterated game in which players never learn what the other players did on previous rounds is the same as a one-shot game. This points to a sixth factor that increases chance of cooperation on safety: high transparency, so that companies may verify their competitors' cooperation on safety. This is closely related to high trust.

**AlexMennen**on A Critique of Functional Decision Theory · 2019-09-16T03:40:17.491Z · LW · GW

I object to the framing of the bomb scenario on the grounds that low probabilities of high stakes are a source of cognitive bias that trip people up for reasons having nothing to do with FDT. Consider the following decision problem: "There is a button. If you press the button, you will be given $100. Also, pressing the button has a very small (one in a trillion trillion) chance of causing you to burn to death." Most people would not touch that button. Using the same payoffs and probabilies in a scenario to challenge FDT thus exploits cognitive bias to make FDT look bad. A better scenario would be to replace the bomb with something that will fine you $1000 (and, if you want, also increase the chance of of error).

But then, it seems to me, that FDT has lost much of its initial motivation: the case for one-boxing in Newcomb’s problem didn’t seem to stem from whether the Predictor was running a simulation of me, or just using some other way to predict what I’d do.

I think the crucial difference here is how easily you can cause the predictor to be wrong. In the case where the predictor simulates you, if you two-box, then the predictor expects you to two-box. In the case where the predictor uses your nationality to predict your behavior, Scots usually one-box, and you're Scottish, if you two-box, then the predictor will still expect you to one-box because you're Scottish.

But now suppose that the pathway by which S causes there to be money in the opaque box or not is that another agent looks at S...

I didn't think that was supposed to matter at all? I haven't actually read the FDT paper, and have mostly just been operating under the assumption that FDT is basically the same as UDT, but UDT didn't build in any dependency on external agents, and I hadn't heard about any such dependency being introduced in FDT; it would surprise me if it did.

**AlexMennen**on A Critique of Functional Decision Theory · 2019-09-16T02:15:11.538Z · LW · GW

I don't know if I'm a simulation or a real person.

A possible response to this argument is that the predictor may be able to accurately predict the agent without explicitly simulating them. A possible counter-response to this is to posit that any sufficiently accurate model of a conscious agent is necessarily conscious itself, whether the model takes the form of an explicit simulation or not.

**AlexMennen**on Troll Bridge · 2019-09-15T16:34:59.969Z · LW · GW

I think the counterfactuals used by the agent are the correct counterfactuals for someone else to use while reasoning about the agent from the outside, but not the correct counterfactuals for the agent to use while deciding what to do. After all, knowing the agent's source code, if you see it start to cross the bridge, it is correct to infer that it's reasoning is inconsistent, and you should expect to see the troll blow up the bridge. But while deciding what to do, the agent should be able to reason about purely causal effects of its counterfactual behavior, screening out other logical implications.

Also, counterfactuals which predict that the bridge blows up seem to be saying that the agent can control whether PA is consistent or inconsistent.

Disagree that that's what's happening. The link between the consistency of the reasoning system and the behavior of the agent is because the consistency of the reasoning system controls the agent's behavior, rather than the other way around. Since the agent is selecting outcomes based on their consequences, it does make sense to speak of the agent choosing actions to some extent, but I think speaking of logical implications of the agent's actions on the consistency of formal systems as "controlling" the consistency of the formal system seems like an inappropriate attribution of agency to me.

**AlexMennen**on A Primer on Matrix Calculus, Part 2: Jacobians and other fun · 2019-08-19T16:44:00.348Z · LW · GW

I suppose why that's not why we're minimizing determinant, but rather frobenius norm.

Yes, although another reason is that the determinant is only defined if the input and output spaces have the same dimension, which they typically don't.

**AlexMennen**on A Primer on Matrix Calculus, Part 3: The Chain Rule · 2019-08-18T19:46:36.750Z · LW · GW

First, a vector can be seen as a list of numbers, and a matrix can be seen as an ordered list of vectors. An ordered list of matrices is... a tensor of order 3. Well not exactly. Apparently some people are actually disappointed with the term tensor because a tensor means something very specific in mathematics already and isn'tjustan ordered list of matrices. But whatever, that's the term we're using for this blog post at least.

It's true that tensors are something more specific than multidimensional arrays of numbers, but Jacobians of functions between tensor spaces (that being what you're using the multidimensional arrays for here) are, in fact, tensors.

**AlexMennen**on A Primer on Matrix Calculus, Part 2: Jacobians and other fun · 2019-08-18T04:58:29.491Z · LW · GW

What this means is for the Jacobian is that the determinant tells us how much space is being squished or expanded in theneighborhoodaround a point. If the output space is being expanded a lot at some input point, then this means that the neural network is a bit unstable at that region, since minor alterations in the input could cause huge distortions in the output. By contrast, if the determinant is small, then some small change to the input will hardly make a difference to the output.

This isn't quite true; the determinant being small is consistent with small changes in input making arbitrarily large changes in output, just so long as small changes in input in a different direction make sufficiently small changes in output.

The frobenius norm is nothing complicated, and is really just a way of describing that we square all of the elements in the matrix, take the sum, and then take the square root of this sum.

An alternative definition of the frobenius norm better highlights its connection to the motivation of regularizing the Jacobian frobenius in terms of limiting the extent to which small changes in input can cause large changes in output: The frobenius norm of a matrix J is the root-mean-square of |J(x)| over all unit vectors x.

**AlexMennen**on [deleted post] 2019-07-10T17:37:58.100Z

"Controlling which Everett branch you end up in" is the wrong way to think about decisions, even if many-worlds is true. Brains don't appear to rely much on quantum randomness, so if you make a certain decision, that probably means that the overwhelming majority of identical copies of you make the same decision. You aren't controlling which copy you are; you're controlling what all of the copies do. And even if quantum randomness does end of mattering in decisions, so that a non-trivial proportion of copies of you make different decisions from each other, then you would still presumably want a high proportion of them to make good decisions; you can do your part to bring that about by making good decisions yourself.

**AlexMennen**on [deleted post] 2019-07-10T17:24:44.198Z

Consider reading a real physicist's take on the issue

This seems phrased to suggest that her view is "the real physicist view" on the multiverse. You could also read what Max Tegmark or David Deutsch, for instance, have to say about multiverse hypotheses and get a "real physicist's" view from them.

Also, she doesn't actually say much in that blog post. She points out that when she says that multiverse hypotheses are unscientific, she doesn't mean that they're false, so this doesn't seem especially useful to someone who wants to know whether there actually is a multiverse, or is interested in the consequences thereof. She says "there is no reason to think we live in such multiverses to begin with", but proponents of multiverse hypotheses have given reasons to support their views, which she doesn't address.

**AlexMennen**on Von Neumann’s critique of automata theory and logic in computer science · 2019-05-29T01:17:38.188Z · LW · GW

#1 (at the end) sounds like complexity theory.

Some of what von Neumann says makes it sound like he's interested in a mathematical foundation for analog computing, which I think has been done by now.