# rohinmshah's Shortform

post by rohinmshah · 2020-01-18T23:21:02.302Z · score: 14 (3 votes) · LW · GW · 22 comments## 22 comments

Comments sorted by top scores.

I often have the experience of being in the middle of a discussion and wanting to reference some simple but important idea / point, but there doesn't exist any such thing. Often my reaction is "if only there was time to write an LW post that I can then link to in the future". So far I've just been letting these ideas be forgotten, because it would be Yet Another Thing To Keep Track Of. I'm now going to experiment with making subcomments here simply collecting the ideas; perhaps other people will write posts about them at some point, if they're even understandable.

Consider two methods of thinking:

1. Observe the world and form some gears-y model of underlying low-level factors, and then make predictions by "rolling out" that model

2. Observe relatively stable high-level features of the world, predict that those will continue as is, and make inferences about low-level factors conditioned on those predictions.

I expect that most intellectual progress is accomplished by people with lots of detailed knowledge and expertise in an area doing option 1.

However, I expect that in the absence of detailed expertise, you will do much better at predicting the world by using option 2.

I think many people on LW tend to use option 1 almost always and my "deference" to option 2 in the absence of expertise is what leads to disagreements like How good is humanity at coordination? [LW · GW]

Conversely, I think many of the most prominent EAs who are skeptical of AI risk are using option 2 in a situation where I can use option 1 (and I think they can defer to people who can use option 1).

Options 1 & 2 sound to me a lot like inside view and outside view. Fair?

Yeah, I think so? I have a vague sense that there are slight differences but I certainly haven't explained them here.

EDIT: Also, I think a major point I would want to make if I wrote this post is that you will almost certainly be quite wrong if you use option 1 without expertise, in a way that other people without expertise won't be able to identify, because there are far more ways the world can be than you (or others) will have thought about when making your gears-y model.

Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?

(My own take is the cop-out-like, "it depends". I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you've put into it, etc.)

Sounds like you probably disagree with the (exaggeratedly stated) point made here then, yeah?

Correct.

My own take is the cop-out-like, "it depends". I think how much you ought to defer to experts varies a lot based on what the topic is, what the specific question is, details of your own personal characteristics, how much thought you've put into it, etc.

I didn't say you should defer to experts, just that if you try to build gears-y models you'll be wrong. It's totally possible that there's no way to get to reliably correct answers and you instead want decisions that are good regardless of what the answer is.

I recently interviewed someone who has a lot of experience predicting systems, and they had 4 steps similar to your two above.

- Observe the world and see if it's sufficient to other systems to predict based on intuitionistic analogies.
- If there's not a good analogy, Understand the first principles, then try to reason about the equilibria of that.
- If that doesn't work, Assume the world will stay in a stable state, and try to reason from that.
- If that doesn't work, figure out the worst case scenario and plan from there.

I think 1 and 2 are what you do with expertise, and 3 and 4 are what you do without expertise.

Yeah, that sounds about right to me. I think in terms of this framework my claim is primarily "for reasonably complex systems, if you try to do 2 without expertise, you will fail, but you may not realize you have failed".

I'm also noticing I mean something slightly different by "expertise" than is typically meant. My intended meaning of "expertise" is more like "you have lots of data and observations about the system", e.g. I think LW self-help stuff is reasonably likely to work (for the LW audience) because people have lots of detailed knowledge and observations about themselves and their friends.

I like this experiment! Keep 'em coming.

In general, evaluate the credibility of experts on the decisions they make or recommend, not on the beliefs they espouse. The selection in our world is based much more on outcomes of decisions than on calibration of beliefs, so you should expect experts to be way better on the former than on the latter.

By "selection", I mean both selection pressures generated by humans, e.g. which doctors gain the most reputation, and selection pressures generated by nature, e.g. most people know how to catch a ball even though most people would get conceptual physics questions wrong.

Similarly, trust decisions / recommendations given by experts more than the beliefs and justifications for those recommendations.

Intellectual progress requires points with nuance. However, on online discussion forums (including LW, AIAF, EA Forum), people seem to frequently lose sight of the nuanced point being made -- rather than thinking of a comment thread as "this is trying to ascertain whether X is true", they seem to instead read the comments, perform some sort of inference over what the author must believe *if that comment were written in isolation*, and then respond to that model of beliefs. This makes it hard to have nuance without adding a ton of clarification and qualifiers everywhere.

I find that similar dynamics happen in group conversations, and to some extent even in one-on-one conversations (though much less so).

Let's say we're talking about something complicated. Assume that any proposition about the complicated thing can be reformulated as a series of conjunctions.

Suppose Alice thinks P with 90% confidence (and therefore not-P with 10% confidence). Here's a fully general counterargument that Alice is wrong:

Decompose P into a series of conjunctions Q1, Q2, ... Qn, with n > 10. (You can first decompose not-P into R1 and R2, then decompose R1 further, and decompose R2 further, etc.)

Ask Alice to estimate P(Qk | Q1, Q2, ... Q{k-1}) for all k.

At least one of these must be over 99% (if we have n = 11 and they were all 99%, then probability of P would be (0.99 ^ 11) = 89.5% which contradicts the original 90%).

Argue that Alice can't possibly have enough knowledge to place under 1% on the negation of the statement.

----

What's the upshot? When two people disagree on a complicated claim, decomposing the question is only a good move *when both people think that is the right way to carve up the question*. Most of the disagreement is likely in how to carve up the claim in the first place.

The simple response to the unilateralist curse under the standard setting is to aggregate opinions amongst the people in the reference class, and then do the majority vote.

A particular flawed response is to look for N opinions that say "intervening is net negative" and intervene iff you cannot find that many opinions. This sacrifices value and induces a new unilateralist curse on people who think the intervention is negative. (Example [LW(p) · GW(p)].)

However, the hardest thing about the unilateralist curse is figuring out how to define the reference class in the first place.

I didn't get it... is the problem with the "look for N opinions" response that you aren't computing the denominator (|"intervening is positive"| + |"intervening is negative"|)?

Yes, that's the problem. In this situation, if N << population / 2, you are likely to not intervene even when the intervention is net positive; if N >> population / 2, you are likely to intervene even when the intervention is net negative.

(This is under the standard model of a one-shot decision where each participant gets a noisy observation of the true value with the noise being iid Gaussians with mean zero.)

Under the standard setting, the optimizer's curse only changes your naive estimate of the EV of the action you choose. It does not change the naive decision you make. So, it is not valid to use the optimizer's curse as a critique of people who use EV calculations to make decisions, but it is valid to use it as a critique of people who make claims about the EV calculations of their most preferred outcome (if they don't already account for it).

Consider the latest AUP equation [? · GW], where for simplicity I will assume a deterministic environment and that the primary reward depends only on state. Since there is no auxiliary reward any more, I will drop the subscripts to on and .

Consider some starting state , some starting action , and consider the optimal trajectory under that starts with that, which we'll denote as . Define to be the one-step inaction states. Assume that . Since all other actions are optimal for , we have , so the max in the equation above goes away, and the total obtained is:

Since we're considering the optimal trajectory, we have

Substituting this back in, we get that the total for the optimal trajectory is

which... uh... diverges to negative infinity, as long as . (Technically I've assumed that is nonzero, which is an assumption that there is always an action that is better than .)

So, you must prefer the always- trajectory to this trajectory. This means that *no matter what the task is* (well, as long as it has a state-based reward and doesn't fall into a trap where is optimal), the agent can never switch to the optimal policy for the rest of time. This seems a bit weird -- surely it should depend on whether the optimal policy is gaining power or not? This seems to me to be much more in the style of satisficing or quantilization than impact measurement.

----

Okay, but this happened primarily because of the weird scaling in the denominator, which we know is mostly a hack based on intuition [AF(p) · GW(p)]. What if we instead just had a constant scaling?

Let's consider another setting. We still have a deterministic environment with a state-based primary reward, and now we also impose the condition that is guaranteed to be a noop: for any state , we have .

Now, for any trajectory with defined as before, we have , so

As a check, in the case where is optimal, we have

Plugging this into the original equation recovers the divergence to negative infinity that we saw before.

But let's assume that we just do a constant scaling to avoid this divergence:

Then for an *arbitrary* trajectory (assuming that the chosen actions are no worse than ), we get

The total reward across the trajectory is then

The and are constants and so don't matter for selecting policies, so I'm going to throw them out:

So in deterministic environments with state-based rewards where is a true noop (even the environment doesn't evolve), AUP with constant scaling is equivalent to adding a penalty for some constant ; that is, we're effectively penalizing the agent from reaching good states, in direct proportion to how good they are (according to ). Again, this seems much more like satisficing or quantilization than impact / power measurement.

The LESS is More paper (summarized in AN #96 [? · GW]) makes the claim that using the Boltzmann model in sparse regions of demonstration-space will lead to the Boltzmann model over-learning. I found this plausible but not obvious, so I wanted to check it myself. (Partly I got nerd-sniped, partly I do want to keep practicing my ability to tell when things are formalizable theorems.) This benefited from discussion with Andreea (one of the primary authors).

Let's consider a model where there are *clusters* , where each cluster contains trajectories whose features are identical (which also implies rewards are identical). Let denote the cluster that belongs to. The Boltzmann model says . The LESS model says , that is, the human chooses a cluster noisily based on the reward, and then uniformly at random chooses a trajectory from within that cluster.

(Note that the paper does something more suited to realistic situations where we have a similarity metric instead of these "clusters"; I'm introducing them as a simpler situation where we can understand what's going on formally.)

In this model, a "sparse region of demonstration-space" is a cluster with small cardinality , whereas a dense one has large .

Let's first do some preprocessing. We can rewrite the Boltzmann model as follows:

This allows us to write both models as first selecting a cluster, and then choosing randomly within the cluster:

Where for LESS is uniform i.e. , whereas for Boltzmann , i.e. a denser cluster is more likely to be sampled.

So now let us return to the original claim that the Boltzmann model overlearns in sparse areas. We'll assume that LESS is the "correct" way to update (which is what the paper is claiming); in this case the claim reduces to saying that the Boltzmann model updates the posterior over rewards in the right direction but with too high a magnitude.

The intuitive argument for this is that the Boltzmann model assigns a lower likelihood to sparse clusters, since its "prior" over sparse clusters is much smaller, and so when it actually observes this low-likelihood event, it must update more strongly. However, this argument doesn't work -- it only claims that , but in order to do a Bayesian update you need to consider likelihood *ratios*. To see this more formally, let's look at the reward learning update:

.

In the last step, any linear terms in that didn't depend on cancelled out. In particular, the prior over the selected class canceled out (though the prior did remain in normalizer / denominator, where it can still affect things). But the simple argument of "the prior is lower, therefore it updates more strongly" doesn't seem to be reflected here.

Also, as you might expect, once we make the shift to thinking of selecting a cluster and then selecting a trajectory randomly, it no longer matters which trajectory you choose -- the only relevant information is the cluster chosen (you can see this in the update above, where the only thing you do with the trajectory is to see which cluster it is in). So from now on I'll just talk about selecting clusters, and updating on them. I'll also write for conciseness.

.

This is a horrifying mess of an equation. Let's switch to odds:

The first two terms are the same across Boltzmann and LESS, since those only differ in their choice of . So let's consider just that last term. Denoting the vector of priors on all classes as , and similarly the vector of exponentiated rewards as , the last term becomes , where is the angle between and . Again, the first term doesn't differ between Boltzmann and LESS, so the only thing that differs between the two is the ratio .

What happens when the chosen class is sparse? Without loss of generality, let's say that ; that is, is a better fit for the demonstration, and so we will update towards it. Since is sparse, is smaller for Boltzmann than for LESS -- which *probably* means that it is better aligned with , which also has a low value of by assumption. (However, this is by no means guaranteed.) In this case, the ratio above would be higher for Boltzmann than for LESS, and so it would more strongly update towards , supporting the claim that Boltzmann would overlearn rather than underlearn when getting a demo from the sparse region.

(Note it does make sense to analyze the effect on the that we update towards, because in reward learning we care primarily about the that we end up having higher probability on.)

I was reading Avoiding Side Effects By Considering Future Tasks, and it seemed like it was doing something very similar to relative reachability. This is an exploration of that; it assumes you have already read the paper and the relative reachability paper. It benefitted from discussion with Vika.

Define the reachability , where is the optimal policy for getting from to , and is the length of the trajectory. This is the notion of reachability both in the original paper and the new one.

Then, for the new paper when using a baseline, the future task value is:

where is the baseline state and is the future goal.

In a deterministic environment, this can be rewritten as:

Here, is relative reachability, and the last line depends on the fact that the goal is equally likely to be any state.

Note that the first term only depends on the number of timesteps, since it only depends on the baseline state s'. So for a fixed time step, the first term is a constant.

The optimal value function in the new paper is (page 3, and using my notation of instead of their ):

.

This is the regular Bellman equation, but with the following augmented reward (here is the baseline state at time t):

Terminal states:

Non-terminal states:

For comparison, the original relative reachability reward is:

The first and third terms in are very similar to the two terms in . The second term in only depends on the baseline.

All of these rewards so far are for *finite-horizon* MDPs (at least, that's what it sounds like from the paper, and if not, they could be anyway). Let's convert them to infinite-horizon MDPs (which will make things simpler, though that's not obvious yet). To convert a finite-horizon MDP to an infinite-horizon MDP, you take all the terminal states, add a self-loop, and multiply the rewards in terminal states by a factor of (to account for the fact that the agent gets that reward infinitely often, rather than just once as in the original MDP). Also define for convenience. Then, we have:

Non-terminal states:

What used to be terminal states that are now self-loop states:

Note that all of the transformations I've done have preserved the optimal policy, so any conclusions about these reward functions apply to the original methods. We're ready for analysis. There are exactly two differences between relative reachability and future state rewards:

**First,** the future state rewards have an extra term, .

This term depends only on the baseline . For the starting state and inaction baselines, the policy cannot affect this term at all. As a result, this term does not affect the optimal policy and doesn't matter.

For the stepwise inaction baseline, this term certainly does influence the policy, but in a bad way: the agent is incentivized to interfere with the environment to preserve reachability. For example, in the human-eating-sushi environment, the agent is incentivized to take the sushi off of the belt, so that in future baseline states, it is possible to reach goals that involve sushi.

**Second,** in non-terminal states, relative reachability weights the penalty by instead of . Really since and thus is an arbitrary hyperparameter, the actual big deal is that in relative reachability, the weight on the penalty switches from in non-terminal states to the smaller in terminal / self-loop states. This effectively means that relative reachability provides an incentive to finish the task faster, so that the penalty weight goes down faster. (This is also clear from the original paper: since it's a finite-horizon MDP, the faster you end the episode, the less penalty you accrue over time.)

**Summary:** The actual effects of the new paper's framing 1. removes the "extra" incentive to finish the task quickly that relative reachability provided and 2. adds an extra reward term that does nothing for starting state and inaction baselines but provides an interference incentive for the stepwise inaction baseline.

(That said, it starts from a very different place than the original RR paper, so it's interesting that they somewhat converge here.)

The LCA paper (to be summarized in AN #98) presents a method for understanding the contribution of specific updates to specific parameters to the overall loss. The basic idea is to decompose the overall change in training loss across training iterations:

And then to decompose training loss across specific parameters:

I've added vector arrows to emphasize that is a vector and that we are taking a dot product. This is a path integral, but since gradients form a conservative field, we can choose any arbitrary path. We'll be choosing the linear path throughout. We can rewrite the integral as the dot product of the change in parameters and the average gradient:

.

(This is pretty standard, but I've included a derivation at the end.)

Since this is a dot product, it decomposes into a sum over the individual parameters:

So, for an individual parameter, and an individual training step, we can define the contribution to the change in loss as

So based on this, I'm going to define my own version of LCA, called . Suppose the gradient computed at training iteration is (which is a vector). uses the approximation , giving . But the SGD update is given by (where is the learning rate), which implies that , which is always negative, i.e. it predicts that every parameter always learns in every iteration. This isn't surprising -- we decomposed the improvement in training into the movement of parameters along the gradient direction, but moving along the gradient direction is exactly what we do to train!

Yet, the experiments in the paper sometimes show positive LCAs. What's up with that? There are a few differences between and the actual method used in the paper:

1. The training method is sometimes Adam or Momentum-SGD, instead of regular SGD.

2. approximates the average gradient with the training gradient, which is only calculated on a *minibatch* of data. LCA uses the loss on the *full training dataset*.

3. uses a point estimate of the gradient and assumes it is the average, which is like a first-order / linear Taylor approximation (which gets worse the larger your learning rate / step size is). LCA proper uses multiple estimates between and to reduce the approximation error.

I *think* those are the only differences (though it's always hard to tell if there's some unmentioned detail that creates another difference), which means that whenever the paper says "these parameters had positive LCA", that effect can be attributed to some combination of the above 3 factors.

----

Derivation of turning the path integral into a dot product with an average:

where

, where the average is defined as .

In my double descent newsletter [AF · GW], I said:

This fits into the broader story being told in other papers that what's happening is that the data has noise and/or misspecification, and at the interpolation threshold it fits the noise in a way that doesn't generalize, and after the interpolation threshold it fits the noise in a way that does generalize. [...]

This explanation seems like it could explain double descent on model size and double descent on dataset size, but I don't see how it would explain double descent on training time. This would imply that gradient descent on neural nets first has to memorize noise in one particular way, and then further training "fixes" the weights to memorize noise in a different way that generalizes better. While I can't rule it out, this seems rather implausible to me. (Note that regularization is not such an explanation, because regularization applies throughout training, and doesn't "come into effect" after the interpolation threshold.)

One response you could have is to think that this could apply even at training time, because typical loss functions like cross-entropy loss and squared error loss very strongly penalize confident mistakes, and so initially the optimization is concerned with getting everything right, only later can it be concerned with regularization.

I don't buy this argument either. I definitely agree that cross-entropy loss penalizes confident mistakes very highly, and has a very high derivative, and so initially in training most of the gradient will be reducing confident mistakes. However, you can get out of this regime simply by predicting the frequencies of each class (e.g. uniform for MNIST). If there are N classes, the worst case loss is when the classes are all equally likely, in which case the average loss per data point is when (as for CIFAR-10, which is what their experiments were done on), which is not a good loss value but it does seem like regularization should already start having an effect. This is a really stupid and simple classifier to learn, and we'd expect that the neural net does at least this well very early in training, well before it reaches the interpolation threshold / critical regime, which is where it gets ~perfect training accuracy.

There is a much stronger argument in the case of L2 regularization on MLPs and CNNs with relu activations. Presumably, if the problem is that the cross-entropy "overwhelms" the regularization initially, then we should also see double descent if we first train only on cross-entropy, and then train with L2 regularization. However, this can't be true. When training on just L2 regularization, the gradient descent update is:

for some constant .

For MLPs and CNNs with relu activations, if you multiply all the weights by a constant, the logits also get multiplied by a constant, no matter what the input is. This means that the train/test error cannot be affected by L2 regularization alone, and so you can't see a double descent on test error in this setting. (This doesn't eliminate the possibility of double descent on test loss, since a change in the magnitude of the logits does affect the cross-entropy, but the OpenAI paper shows double descent on test error as well, and that provably can't happen in the "first train to zero error with cross-entropy and then regularize" setting.)

The paper tests with CNNs, but doesn't mention what activation they use. Still, I'd find it very surprising if double descent only happened for a particular activation function.