## Posts

Improving capital gains taxes 2021-07-09T05:20:05.294Z
How much chess engine progress is about adapting to bigger computers? 2021-07-07T22:35:29.245Z
Experimentally evaluating whether honesty generalizes 2021-07-01T17:47:57.847Z
paulfchristiano's Shortform 2021-06-29T01:33:14.099Z
Avoiding the instrumental policy by hiding information about humans 2021-06-13T20:00:51.597Z
Answering questions honestly given world-model mismatches 2021-06-13T18:00:08.396Z
A naive alignment strategy and optimism about generalization 2021-06-10T00:10:02.184Z
Decoupling deliberation from competition 2021-05-25T18:50:03.879Z
Mundane solutions to exotic problems 2021-05-04T18:20:05.331Z
Low-stakes alignment 2021-04-30T00:10:06.163Z
AMA: Paul Christiano, alignment researcher 2021-04-28T18:55:39.707Z
Announcing the Alignment Research Center 2021-04-26T23:30:02.685Z
Another (outer) alignment failure story 2021-04-07T20:12:32.043Z
My research methodology 2021-03-22T21:20:07.046Z
Demand offsetting 2021-03-21T18:20:05.090Z
It’s not economically inefficient for a UBI to reduce recipient’s employment 2020-11-22T16:40:05.531Z
Hiring engineers and researchers to help align GPT-3 2020-10-01T18:54:23.551Z
“Unsupervised” translation as an (intent) alignment problem 2020-09-30T00:50:06.077Z
Distributed public goods provision 2020-09-26T21:20:05.352Z
Better priors as a safety problem 2020-07-05T21:20:02.851Z
Learning the prior 2020-07-05T21:00:01.192Z
Inaccessible information 2020-06-03T05:10:02.844Z
Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z
Hedonic asymmetries 2020-01-26T02:10:01.323Z
Moral public goods 2020-01-26T00:10:01.803Z
Of arguments and wagers 2020-01-10T22:20:02.213Z
Prediction markets for internet points? 2019-10-27T19:30:00.898Z
AI alignment landscape 2019-10-13T02:10:01.135Z
Taxing investment income is complicated 2019-09-22T01:30:01.242Z
The strategy-stealing assumption 2019-09-16T15:23:25.339Z
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z
What failure looks like 2019-03-17T20:18:59.800Z
Security amplification 2019-02-06T17:28:19.995Z
Reliability amplification 2019-01-31T21:12:18.591Z
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z
Thoughts on reward engineering 2019-01-24T20:15:05.251Z
Learning with catastrophes 2019-01-23T03:01:26.397Z
Capability amplification 2019-01-20T07:03:27.879Z
The reward engineering problem 2019-01-16T18:47:24.075Z
Towards formalizing universality 2019-01-13T20:39:21.726Z
Directions and desiderata for AI alignment 2019-01-13T07:47:13.581Z
Ambitious vs. narrow value learning 2019-01-12T06:18:21.747Z
AlphaGo Zero and capability amplification 2019-01-09T00:40:13.391Z
Supervising strong learners by amplifying weak experts 2019-01-06T07:00:58.680Z
Benign model-free RL 2018-12-02T04:10:45.205Z
Corrigibility 2018-11-27T21:50:10.517Z
Humans Consulting HCH 2018-11-25T23:18:55.247Z

Comment by paulfchristiano on Another (outer) alignment failure story · 2021-08-01T17:56:15.040Z · LW · GW

I think the AI systems in this story have a clear understanding of the the difference between the measurement and the thing itself.

Are humans similarly like drug addicts, because we'd prefer experience play and love and friendship and so on even though we understand those things are mediocre approximations to "how many descendants we have"?

Comment by paulfchristiano on Answering questions honestly given world-model mismatches · 2021-07-31T02:01:54.172Z · LW · GW

Note that HumanAnswer and IntendedAnswer do different things. HumanAnswer spreads out its probability mass more, by first making an observation and then taking the whole distribution over worlds that were consistent with it.

Abstracting out Answer, let's just imagine that our AI outputs a distribution  over the space of trajectories  in the human ontology, and somehow we define a reward function  evaluated by the human in hindsight after getting the observation . The idea is that this is calculated by having the AI answer some questions about what it believes etc but we'll abstract that all out.

Then the conclusion in this post holds under some convexity assumption on , since then spreading out your mass can't really hurt you (since the human has no way to prefer your pointy estimate). But e.g. if you just penalized  for being uncertain, then IntendedAnswer could easily outperform HumanAnswer. Similarly, if we require that  satisfy various conditional independence properties then we may rule out HumanAnswer.

The more precise bad behavior InstrumentalAnswer is to output the distribution . Of course nothing else is going to get a higher reward. This is about as simple as HumanAnswer. It could end up being slightly more computationally complex. I think everything I've said about this case still applies for InstrumentalAnswer, but it's relevant when I start talking about stuff like conditional independence requirements between the model's answers.

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-31T01:11:26.621Z · LW · GW

Actually if A --> B --> C and I observe some function of (A, B, C) it's just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it's actually difficult before concluding that this kid of conditional independence structure is potentially useful.

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-30T22:53:07.336Z · LW · GW

This is also a way to think about the proposals in this post and the reply:

• The human believes that A' and B' are related in a certain way for simple+fundamental reasons.
• On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different.
• For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can replace all of the rest of the computational graph with nonsense, or independent samples, and get the same relationship.)
• For the bad function, you have to look at basically the whole graph. That is, it's not the case that the human's beliefs about A' and B' have the right relationship for arbitrary Ys, they only have the right relationship for a very particular distribution of Ys. So to see that A' and B' have the right relationship, we need to simulate the actual underlying dynamics where A --> B, since that creates the correlations in Y that actually lead to the expected correlations between A' and B'.
• It seems like we believe not only that A' and B' are related in a certain way, but that the relationship should be for simple reasons, and so there's a real sense in which it's a bad sign if we need to do a ton of extra compute to verify that relationship. I still don't have a great handle on that kind of argument. I suspect it won't ultimately come down to "faster is better," though as a heuristic that seems to work surprisingly well. I think that this feels a bit more plausible to me as a story for why faster would be better (but only a bit).
• It's not always going to be quite this cut and dried---depending on the structure of the human beliefs we may automatically get the desired relationship between A' and B'. But if that's the case then one of the other relationships will be a contingent fact about Y---we can't reproduce all of the expected relationships for arbitrary Y, since our model presumably makes some substantive predictions about Y and if those predictions are violated we will break some of our inferences.
Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-30T22:38:47.492Z · LW · GW

So are there some facts about conditional independencies that would privilege the intended mapping? Here is one option.

We believe that A' and C' should be independent conditioned on B'. One problem is that this isn't even true, because B' is a coarse-graining and so there are in fact correlations between A' and C' that the human doesn't understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B'. For example, if you imagine Y preserving some facts about A' and C', and if the human is sometimes mistaken about B'=B, then we will introduce extra correlations between the human's beliefs about A' and C'.

I think it's pretty plausible that there are necessarily some "new" correlations in any case where the human's inference is imperfect, but I'd like to understand that better.

So I think the biggest problem is that none of the human's believed conditional independencies actually hold---they are both precise, and (more problematically) they may themselves only hold "on distribution" in some appropriate sense.

This problem seems pretty approachable though and so I'm excited to spend some time thinking about it.

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-30T22:21:07.553Z · LW · GW

Causal structure is an intuitively appealing way to pick out the "intended" translation between an AI's model of the world and a human's model. For example, intuitively "There is a dog" causes "There is a barking sound." If we ask our neural net questions like "Is there a dog?" and it computes its answer by checking "Does a human labeler think there is a dog?" then its answers won't match the expected causal structure---so maybe we can avoid these kinds of answers.

What does that mean if we apply typical definitions of causality to ML training?

• If we define causality in terms of interventions, then this helps iff we have interventions in which the labeler is mistaken. In general, it seems we could just include examples with such interventions in the training set.
• Similarly, if we use some kind of closest-possible-world semantics, then we need to be able to train models to answer questions consistently about nearby worlds in which the labeler is mistaken. It's not clear how to train a system to do that. Probably the easiest is to have a human labeler in world X talking about what would happen in some other world Y, where the labeling process is potentially mistaken. (As in "decoupled rl" approaches.) However, in this case it seems liable to learn the "instrumental policy" that asks "What does a human in possible world X think about what would happen in world Y?" which seems only slightly harder than the original.
• We could talk about conditional independencies that we expect to remain robust on new distributions (e.g. in cases where humans are mistaken). I'll discuss this a bit in a reply.

Here's an abstract example to think about these proposals, just a special case of the example from this post.

• Suppose that reality M is described as a causal graph X --> A --> B --> C, and then the observation Y is a function of (A, B, C).
• The human's model M' of the situation is X --> A' --> B' --> C'. Each of them is a coarse-graining of the corresponding part of the real world model, and the observation Y is still a function of (A', B', C'), it's just more uncertain now.
• The coarse-grained dynamics are simpler than the actual coarse-graining f: (A, B, C) --> (A', B', C').
• We prepare a dataset by actually sampling (X, A, B, C, Y) from M, having humans look at it, make inferences about (A', B', C'), and get a dataset of (X, A', B', C', Y) tuples to train a model.
• The intended question-answering function is to use M to sample (A, B, C, Y) then apply the coarse-graining f to get (A', B', C'). But there is also a bad function that produces good answers on the training dataset: use M to sample (A, B, C, Y), then use the human's model to infer (A', B', C'), and output those.
• We'd like to rule out this bad function by making some kind of assumption about causal structure.
Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-26T20:46:01.705Z · LW · GW

This is interesting to me for two reasons:

• [Mainly] Several proposals for avoiding the instrumental policy work by penalizing computation. But I have a really shaky philosophical grip on why that's a reasonable thing to do, and so all of those solutions end up feeling weird to me. I can still evaluate them based on what works on concrete examples, but things are slippery enough that plan A is getting a handle on why this is a good idea.
• In the long run I expect to have to handle learned optimizers by having the outer optimizer instead directly learn whatever the inner optimizer would have learned.  This is an interesting setting to look at how that works out. (For example, in this case the outer optimizer just needs to be able to represent the hypothesis "There is a program that has property P and runs in time T' " and then do its own search over that space of faster programs.)
Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-26T20:39:17.100Z · LW · GW

The speed prior still delegates to better search algorithms though. For example, suppose that someone is able to fill in a 1000 bit program using only 2^500 steps of local search. Then the local search algorithm has speed prior complexity 500 bits, so will beat the object-level program. And the prior we'd end up using is basically "2x longer = 2 more bits" instead of "2x longer = 1 more bit," i.e. we end up caring more about speed because we delegated.

The actual limit on how much you care about speed is given by whatever search algorithms work best. I think it's likely possible to "expose" what is going on to the outer optimizer (so that it finds a hypothesis like "This local search algorithm is good" and then uses it to find an object-level program, rather than directly finding a program that bundles both of them together). But I'd guess intuitively that it's just not even meaningful to talk about the "simplest" programs or any prior that cares less about speed than the optimal search algorithm.

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-26T20:21:12.391Z · LW · GW

In traditional settings, we are searching for a program M that is simpler than the property P. For example, the number of parameters in our model should be smaller than the size of the dataset we are trying to fit if we want the model to generalize. (This isn't true for modern DL because of subtleties with SGD optimizing imperfectly and implicit regularization and so on, but spiritually I think it's still fine..)

But this breaks down if we start doing something like imposing consistency checks and hoping that those change the result of learning. Intuitively it's also often not true for scientific explanations---even simple properties can be surprising and require explanation, and can be used to support theories that are much more complex than the observation itself.

Some thoughts:

1. It's quite plausible that in these cases we want to be doing something other than searching over programs. This is pretty clear in the "scientific explanation" case, and maybe it's the way to go for the kinds of alignment problems I've been thinking about recently.

A basic challenge with searching over programs is that we have to interpret the other data. For example, if "correspondence between two models of physics" is some kind of different object like a description in natural language, then some amplified human is going to have to be thinking about that correspondence to see if it explains the facts. If we search over correspondences, some of them will be "attacks" on the human that basically convince them to run a general computation in order to explain the data. So we have two options: (i) perfectly harden the evaluation process against such attacks, (ii) try to ensure that there is always some way to just directly do whatever the attacker convinced the human to do. But (i) seems quite hard, and (ii) basically requires us to put all of the generic programs in our search space.

2. It's also quite plausible that we'll just give up on things like consistency conditions. But those come up frequently enough in intuitive alignment schemes that I at least want to give them a fair shake.
Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-26T20:13:59.972Z · LW · GW

The speed prior is calibrated such that this never happens if the learned optimizer is just using brute force---if it needs to search over 1 extra bit then it will take 2x longer, offsetting the gains.

That means that in the regime where P is simple, the speed prior is the "least you can reasonably care about speed"---if you care even less, you will just end up pushing the optimization into an inner process that is more concerned with speed and is therefore able to try a bunch of options.

(However, this is very mild, since the speed prior cares only a tiny bit about speed. Adding 100 bits to your program is the same as letting it run 2^100 times longer, so you are basically just optimizing for simplicity.)

To make this concrete, suppose that I instead used the kind-of-speed prior, where taking 4x longer is equivalent to using 1 extra bit of description complexity. And suppose that P is very simple relative to the complexities of the other objects involved. Suppose that the "object-level" program M has 1000 bits and runs in 2^2000 time, so has kind-of-speed complexity 2000 bits.  A search that uses the speed prior will be able to find this algorithm in 2^3000 time, and so will have a kind-of-speed complexity of 1500 bits. So the kind-of-speed prior will just end up delegating to the speed prior.

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-26T17:09:56.339Z · LW · GW

Suppose I am interested in finding a program M whose input-output behavior has some property P that I can probabilistically check relatively quickly (e.g. I want to check whether M implements a sparse cut of some large implicit graph). I believe there is some simple and fast program M that does the trick. But even this relatively simple M is much more complex than the specification of the property P.

Now suppose I search for the simplest program running in time T that has property P. If T is sufficiently large, then I will end up getting the program "Search for the simplest program running in time T' that has property P, then run that." (Or something even simpler, but the point is that it will make no reference to the intended program M since encoding P is cheaper.)

I may be happy enough with this outcome, but there's some intuitive sense in which something weird and undesirable has happened here (and I may get in a distinctive kind of trouble if P is an approximate evaluation). I think this is likely to be a useful maximally-simplified example to think about.

Comment by paulfchristiano on A closer look at chess scalings (into the past) · 2021-07-23T19:22:23.739Z · LW · GW

The results look quite different for Houdini 3 vs SF8---is this just a matter of Stockfish being much better optimized for small amounts of hardware?

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-23T00:52:37.057Z · LW · GW

We might be able to get similar advantages with a more general proposal like:

Fit a function f to a (Q, A) dataset with lots of questions about latent structure. Minimize the sum of some typical QA objective and the computational cost of verifying that f is consistent.

Then the idea is that matching the conditional probabilities from the human's model (or at least being consistent with what the human believes strongly about those conditional probabilities) essentially falls out of a consistency condition.

It's not clear how to actually formulate that consistency condition, but it seems like an improvement over the prior situation (which was just baking in the obviously-untenable requirement of exactly matching). It's also not clear what happens if this consistency condition is soft.

It's not clear what "verify that the consistency conditions are met" means. You can always do the same proposal as in the parent, though it's not really clear if that's a convincing verification. But I think that's a fundamental philosophical problem that both of these proposals need to confront.

It's not clear how to balance computational cost and the QA objective. But you are able to avoid most of the bad properties just by being on the Pareto frontier, and I don't think this is worse than the prior proposal.

Overall this approach seems like it could avoid making such strong structural assumptions about the underlying model. It also helps a lot with the overlapping explanations + uniformity problem. And it generally seems to be inching towards feeling plausible.

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-21T02:17:14.173Z · LW · GW

Here's another approach to "shortest circuit" that is designed to avoid this problem:

• Learn a circuit  that outputs an entire set of beliefs. (Or maybe some different architecture, but with ~0 weight sharing so that computational complexity = description complexity.)
• Impose a consistency requirement on those beliefs, even in cases where a human can't tell the right answer.
• Require 's beliefs about  to match . We hope that this makes  an explication of "'s beliefs."
• Optimize some combination of (complexity) vs (usefulness), or chart the whole pareto frontier, or whatever. I'm a bit confused about how this step would work but there are similar difficulties for the other posts in this genre so it's exciting if this proposal gets to that final step.

The "intended" circuit  just follows along with the computation done by  and then translates its internal state into natural language.

What about the problem case where  computes some reasonable beliefs (e.g. using the instrumental policy, where the simplicity prior makes us skeptical about their generalization) that  could just read off? I'll imagine those being written down somewhere on a slip of paper inside of 's model of the world.

• Suppose that the slip of paper is not relevant to predicting , i.e. it's a spandrel from the weight sharing. Then the simplest circuit  just wants to cut it out. Whatever computation was done to write things down on the slip of paper can be done directly by , so it seems like we're in business.
• So suppose that the slip of paper is relevant for predicting , e.g. because someone looks at the slip of paper and then takes an action that affects . If (the correct)  is itself depicted on the slip of paper, then we can again cut out the slip of paper itself and just run the same computation (that was done by whoever wrote something on the slip of paper). Otherwise, the answers produced by  still have to contain both the items on the slip of paper as well as some facts that are causally downstream of the slip of paper (as well as hopefully some about the slip of paper itself). At that point it seems like we have a pretty good chance of getting a consistency violation out of .

Probably nothing like this can work, but I now feel like there are two live proposals for capturing the optimistic minimal circuits intuition---the one in this current comment, and in this other comment. I still feel like the aggressive speed penalization is doing something, and I feel like probably we can either find a working proposal in that space or else come up with some clearer counterexample.

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-20T16:43:38.635Z · LW · GW

I was proposing exempting the short-term risk-free rate, and I was imagining using 30 day treasury yield a the metric. (The post originally said that but it got simplified in the interest of clarity---of course "savings account" is vague since they pay different amounts with different risk, but it seems to communicate basically the same stuff.) That's also roughly the rate at which you'd borrow if using leverage to offset your tax burden (e.g. it's roughly the rate embedded in futures or at which investors can borrow on margin).

Comment by paulfchristiano on Benchmarking an old chess engine on new hardware · 2021-07-16T15:24:26.899Z · LW · GW

Very interesting, thanks!

• Could you confirm how much you have to scale down SF13 in order to match SF3? (This seems similar to what you did last time, but a more direct comparison.)
• The graph from last time makes it look like SF13 would match Rebel at about 20k nodes/move. Could you also confirm that?
• Looking forward to seeing the scaled-up Rebel results.
Comment by paulfchristiano on A closer look at chess scalings (into the past) · 2021-07-15T20:13:30.750Z · LW · GW

In another comment you wrote "In between is the region with ~70 ELO; that's where engines usually operate on present hardware with minutes of think time" which made sense to me, I'm just trying to square that with this graph.

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-15T20:07:04.587Z · LW · GW

Recently I've been thinking about ML systems that generalize poorly (copying human errors) because of either re-using predictive models of humans or using human inference procedures to map between world models.

My initial focus was on preventing re-using predictive models of humans. But I'm feeling increasingly like there is going to be a single solution to the two problems, and that the world-model mismatch problem is a good domain to develop the kind of algorithm we need. I want to say a bit about why.

I'm currently thinking about dealing with world model mismatches by learning a correspondence between models using something other than a simplicity prior / training a neural network to answering questions. Intuitively we want to do something more like "lining up" the two models and seeing what parts correspond to which others. We have a lot of conditions/criteria for such alignments, so we don't necessarily have to just stick with simplicity. This comment fleshes out one possible approach a little bit.

If this approach succeeds, then it also directly applicable to avoiding re-using human models---we want to be lining up the internal computation of our model with concepts like "There is a cat in the room" rather than just asking the model to predict whether there is a cat however it wants (which it may do by copying a human labeler). And on the flip side, I think that the "re-using human models" problem is a good constraint to have in mind when thinking about ways to do this correspondence. (Roughly speaking, because something like  computational speed or "locality" seems like a really central constraint for matching up world models, and doing that approach naively can greatly exacerbate the problems with copying the training process.)

So for now I think it makes sense for me to focus on whether learning this correspondence is actually plausible. If that succeeds then I can step back and see how that changes my overall view of the landscape (I think it might be quite a significant change), and if it fails then I hope to at least know a bit more about the world model mismatch problem.

I think the best analogy in existing practice is probably doing interpretability work---mapping up the AI's model to my model is kind of like looking at neurons and trying to make sense of what they are computing (or looking for neurons that compute something). And giving up on a "simplicity prior" is very natural when doing interpretability, instead using other considerations to determine whether a correspondence is good. It still seems kind of plausible that in retrospect my current work will look like it was trying to get a solid theoretical picture on what interpretability should do (including in the regime where the correspondence is quite complex, and when the goal is a much more complete level of understanding). I swing back and forth on how strong the analogy to interpretability seems / whether or not this is how it will look in retrospect. (But at any rate, my research methodology feels like a very different approach to similar questions.)

Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-15T18:45:15.156Z · LW · GW

Here's a slightly more formal algorithm along these lines:

• Assume that both the human's model  and the AI's model   are Bayesian networks where you compute the probability distribution over a node 's value based on the values of its parents . I'll write  for the set of values that a node  can take on (in either model), and  for the joint values of a set of nodes .
• A correspondence tells you how to compute the value of each node  in the human's model. This consistent of (i) a neighborhood in the AI's model  which suffices to determine , (ii) a function .
• Both the AI's model and the human model contain some distinguished observation nodes.  must be the identity on these nodes.
• An "explanation" of a correspondence consists of a set of nodes   in the AI's model for each node  in the human's model. The intuition is that we can run a simulation involving only these nodes in order to reproduce the probability distribution of  given its parents' values.
• In particular, , and  for all . In order to check whether  reproduces the right distribution, we first sample values at random for all the nodes some of whose parents aren't in . Then we sample values for the remaining nodes. We can use  to compute the corresponding values for  and all of its parents. And then we can compute the conditional distributions for  given each set of values for its parents.
• We require that the explanations exactly reproduce the conditional probability over  given .
• The "cost" of the explanation of  is the sum of the compute required to sample all the nodes in . The "cost" of the correspondence  is the compute required to evaluate it.
• We search for the set of correspondences and explanations for which the total cost is minimized.
• (Maybe we also have some requirement where the correspondence  agrees with some training data about . I'm not really sure about that.)

Reviewing how this behaves in each of the bad cases from the parent:

• It's very bad to define  by computing the observation and then using the human's inference algorithm. The entire motivation for defining it this way was to save on description complexity, but  is only being penalized for computational complexity. (This also forces every single  to include the entire process that generates the observations, which seems like it should be an even bigger deal. But this feels less like the "right" reason and I think it might change for a future version of the scheme.)
• If there is a human in the simulation who knows the value of , it's extremely bad to define  to be that human. This is because the explanation  will need to include the entire process by which the human computes that value. This is guaranteed to be more expensive than just performing that compute in the correspondence  itself.
• We could define  to compute the value of  from scratch using the initial values of the simulation, using some more-efficient simulation. But we can't do this for the observation node , since it is required to map up with the observation node in the AI's model. So in order to explain the observation node we will need to have  include the AI's entire model, or at least the parts leading up to the observation. This means we are basically paying for two full copies of the computation, and so we're not really benefiting from the fact that the second copy is more efficient. (We are also potentially overlapping a lot of computation between the different 's, but again I'm a bit less sure how robust that is and whether it will hold up in different formalizations.)

There are a lot of problems and missing details in this proposal:

• This requires exactly reproducing the conditional probabilities in the human's model. But that's totally unrealistic for anything short of a full simulation---and even for the full simulation it wouldn't reproduce the probabilities since the human model isn't accurate. So we need some way to choose a good enough explanation, i.e. a way of balancing the computational complexity of the explanation against the quality of the conditional probabilities that come out.
• We're sampling the inputs to  uniformly at random. This seems unlikely to work in general. We could easily sample each node from its marginal, but most of the action is in the correlation. Allowing arbitrary correlations causes problems (since you could just specify the "human is accurate" correlation and then read off the correct answers from there). So I think probably some more flexible system is needed here; there are a lot of options but it's tricky.
• There is something problematic about the overlapping explanations . If they overlap you need to pay for all of them, but for the intended mapping there will often be quite significant overlap. This isn't inherently a problem, but I'm scared that it's going to introduce a lot of pressure towards some different correspondence that is able to avoid that problem. We need to penalize overlap because the case where the training data is embedded in the model---the main problem with that model is that you need to separately explain every way in which the human is correct with highly overlapping explanations. If you didn't penalize those then you may just end up with the embedded explanations (for which  is extremely cheap).
• There is something tricky about uniformity in the model and in the implementations of .
• I'm still scared about the "recompute everything from scratch" failure mode. The model does need to have a single explanation  that needs to include the whole model. But (i) it doesn't have to reproduce work, (ii) it can cut out all the stuff not on the path to the observation. So the obvious reason that this one loses is by the duplicated work in . Hopefully that's actually robust.
• We are making really strong structural assumptions on the models and the correspondence between them. We get some things for free (because humans actually do have extra structure in our beliefs about the world that is properly part of the problem statement, and the AI's model is constrained by its architecture) but not nearly this much.

Overall I'm becoming significantly more optimistic that something like this will work (though still less likely than not). Trying to step back and see the big picture, it seems like there are three key active ingredients:

• Using "speed" instead of "simplicity" as the ~only requirement for these correspondences.
• Having separate correspondences for separate properties and not allowing them to share tons of computation with each other (to prevent re-running the whole simulation).
• Forcing the model to explain correlations, so that using an "embedded" copy of the answers (like a simulation of the data-generating process) forces you to reproduce the computation that produced that answer.

My next step would probably be looking at cases where these high-level ingredients aren't sufficient (e.g. are there cases where "generate obs then do inference in the human model" is actually cheaper?). If they look pretty good, then I'll spend some more time trying to fill in the details in a more plausible way.

Comment by paulfchristiano on Answering questions honestly instead of predicting human answers: lots of problems and some solutions · 2021-07-15T17:30:03.094Z · LW · GW
• I don't think you actually want to use supervised training for training , you want to use feedback of the form "Is this answer much wronger than that answer?" and then train the model to not produce definitely-wrong answers.
• Likewise the  constraint would really want to be something softer (e.g. forcing  to give plausible-looking answers to questions as evaluated by ).
• I think that most questions about what is useful / tacitly assumed / etc. can be easily handled on top of the "raw" ability to elicit the model's knowledge (if you like you could imagine having a debate about which answer is better all things considered, using  to assess the model's beliefs about closed question)
• I do think there are a lot of problems along these lines that you'd want to think about a bunch in theory, and then later need to do a bunch of empirical work on. But unfortunately I also think there are a lot of "bigger fish to fry" that are very likely to sink this entire family of approaches. So the first order of business is understanding those and wandering our way to a general category of solution that might actually work.
Comment by paulfchristiano on A closer look at chess scalings (into the past) · 2021-07-15T17:24:25.111Z · LW · GW

I'm quite surprised by how far out on the Elo vs compute curve we already are by a million nodes/move. Is this the main "target platform" for stockfish, or are people mostly trying to optimize the performance for significantly smaller node counts?

(I'm wondering whether such strong diminishing returns are fundamental to the domain, or whether people are putting the most work into optimizing performance down at more like 100kNodes/sec.)

Comment by paulfchristiano on A closer look at chess scalings (into the past) · 2021-07-15T16:49:54.960Z · LW · GW

I'm confused about the top end of the graph. Shouldn't SF8 with the reference compute basically match the final datapoints? But it looks like you'd have to scale it up extremely far to get to such a high elo.

Comment by paulfchristiano on Experimentally evaluating whether honesty generalizes · 2021-07-14T00:43:54.845Z · LW · GW

I do expect "explanations of what's going on in this sentence" to be a lot weaker than translations.

For that task, I expect that the model trained on coherence + similar tasks will outperform a 10x larger pre-trained model. If the larger pre-trained model gets context stuffing on similar tasks, but no coherence training, then it's less clear to me.

But I guess the point is that the differences between various degrees of successful-generalization will be relatively small compared to model size effects. It doesn't matter so much how good the transfer model is relative to the pre-trained baseline, it matters how large the differences between the possible worlds that we are hoping to distinguish are.

I guess my main hope there is to try to understand whether there is some setting where transfer works quite well, either getting very close to the model fine-tuned on distribution, or at least converging as the pre-trained model grows. Hopefully that will make it easier to notice the effects we are looking for, and it's OK if those effects are small relative to model doublings.

(Also worth noting that "as good as increasing model size by 10%" is potentially quite economically relevant. So I'm mostly just thinking about the extent to which it can make effects hard to measure.)

Comment by paulfchristiano on Experimentally evaluating whether honesty generalizes · 2021-07-14T00:37:12.797Z · LW · GW

The issue, then, is that the "fine-tuning for correctness" and "fine-tuning for coherence" processes are not really equivalent--fine-tuning for correctness is in fact giving GPT-3 additional information about tone, which improves its capabilities. In addition, GPT-3 might not "know" exactly what humans mean by the word tone, and so fine-tuning for correctness also helps GPT-3 to better understand the question.

Part of my hope is that "coherence" can do quite a lot of the "telling you what humans mean about tone." For example, you can basically force the model to talk (in English) about what things contribute to tone, and why it thinks the tone is like such and such (or even what the tone of English sentences is)---anything that a human who doesn't know French can evaluate. And taken together those things seem like enough to mostly pin down what we are talking about.

Given these considerations, my modal expectation is that fine-tuning for correctness will provide moderately better results than just doing coherence, but it won't be clear how to interpret the difference--maybe in both cases GPT-3 provides incoherent outputs 10% of the time, and then additionally coherent but wrong outputs 10% of the time when fine-tuned for correctness, but 17% of the time when fine-tuned only for coherence. What would you conclude from a result like that?

I'd tentatively interpret that as a negative result, but I agree with your comments below that ultimately a lot of what we care about here is the scaling behavior and putting together a more holistic picture of what's going on, in particular:

• As we introduce stronger coherence checks, what happens to the accuracy? Is it approaching the quality of correctness, or is it going to asymptote much lower?
• Is the gap shrinking as model quality improves, or growing? Do we think that very large models would converge to a small gap or is it a constant?

I'm also quite interested in the qualitative behavior. Probably most interesting are the cases where the initial model is incoherent, the coherence-tuned model is coherent-but-wrong, and the correctness-tuned model is correct. (Of course every example is also fuzzy because of noise from sampling and training, but the degree of fuzziness is smaller as we remove randomness.) In these cases, what is happening with the coherence-tuned model? Are we able to see cases where it cleanly feels like the "wrong" generalization, or is it a plausible ambiguity about what we were looking for? And so on.

I'm interested in the related engineering question: in this setting, what can we do to improve the kind of generalization we get? Can we get some handle on the performance gap and possible approaches to closing it?

And finally I'm interested in understanding how the phenomenon depends on the task: is it basically similar in different domains / for different kinds of question or quite different? How does it depend on the number / type / degree of similarity of the categories?

So perhaps my main feedback would be to think about how likely you think such an outcome is, how much you mind that

I generally agree that my post simplified the empirical situation and actually getting convincing results would require careful empirical work. I do think that initial results (like the 17% vs 10% error rate) would help provide some basic orientation; even if it's a small datapoint with unclear conclusions it still gives us some sense of what is basically going on, what kind of numbers we are talking about, and what would actually be interesting to measure.

(My guess is that we are roughly on the same page here.)

if there are alternative tasks that avoid this issue without being significantly more complicated.

I do think it's worth spending time trying to think of better tasks, though I'm not optimistic about finding something a lot better (e.g. avoiding the need for doing a bunch of experiments to understand how the results vary and trying to do some extrapolation to big models).

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-13T00:45:14.713Z · LW · GW

Let's say the risk-free rate is 0 and the tax rate is 50%. Then 2x leverage doubles your profit and doubles your losses---that's the sense in which it increases risk. But then taxes cut your profit and losses by the same 50%.

So consider an investment that doubles your money with probability 60%. Without taxes you wanted to invest $X, and have a 60% of making$X and a 40% chance of losing $X. But with a 50% tax rate, you want to invest$2X. Then you have a 60% chance of making $2X, paying half in taxes, and ending up with$X in profit; and a 40% chance of losing $2X, getting half back as a tax rebate, and ending up with$X in losses. So the outcome is identical to the pre-tax world.

Getting back the money immediately, without FUD about whether you'll ever be able to use the tax rebate, is pretty important to meaningfully reducing your risk. (You also are going to need that rebate in order to pay off the margin loan, and someone is willing to lend it to you precisely because they know that you can use your tax rebate to make them whole if you get wiped out. One reason this may not work in practice is that the person making the margin loan may be concerned about seniority of their debt if they can't directly claim your tax rebate in the same way a margin lender would traditionally liquidate your assets.)

If the risk-free rate is not zero then the exact same analysis applies---2x leverage multiplies (your return - the risk free rate) by 2, and then a 50% tax rate reduces (your return - the risk free rate) by 2.

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-12T15:56:17.317Z · LW · GW

Sure, sorry for the shorthand.

Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-11T18:59:01.336Z · LW · GW

Stockfish 12 and newer have neural network (NNUE)-based evaluation enabled by default so I wouldn't say that Stockfish is similar to other non-NN modern engines.

I was imagining using Stockfish from before the introduction of NNUE (I think that's August 2020?). Seems worth being careful about.

https://nextchessmove.com/dev-builds is based on playing various versions of Stockfish against each other. However, it is known that this overestimates the ELO gain. I believe +70 ELO for doubling compute is also on the high side, even on single-core computers.

I am very interested in the extent to which "play against copies of yourself" overstates the elo gains.

I am hoping to get some mileage / robustness out of the direct comparison---how much do we have to scale up/down the old/new engine for them to be well-matched with each other? Hopefully that will look similar to the numbers from looking directly at Elo.

(But point taken about the claimed degree of algorithmic progress above.)

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T21:16:53.685Z · LW · GW

I don't think the current system really makes any sense whatsoever. I think it would make sense to (i) allow donors to value assets at either basis or market price (regardless of long-term or short-term holdings, none of this absurd nonsense in the status quo), (ii) charge capital gains taxes on any assets being valued at market price when donated, (iii) allow deductions for up to 100% of income instead of the crazy complicated set of limits currently imposed.

(You could also just value everything at cost and then have people sell assets if they want to value their donation at market value. Either way seems fine.)

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T17:56:21.830Z · LW · GW

I do agree it's somewhat annoying accounting, but it's like 1% of the annoyance of the current tax code.

I think the case for progressive taxes is reasonably good:

• From behind the veil of ignorance I'd prefer more redistribution than I can get from an efficient flat tax.
• Also see the minimizing-distortion argument in this post.

That said, I'd also be reasonably happy with something like a flat 50% tax + $20k UBI which I guess is the kind of thing you have in mind? A progressive consumption tax doesn't seem like too much trouble: • Income goes into your savings account. You make investments in that account. • You can't consume anything from your savings account. You have to first move the money to a personal checking account. • The only number you report on your tax return is "How much money did I move into my personal checking account?" To me this seems like quite plausibly less accounting burden than everyone who makes anything charging a VAT. Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T02:24:23.339Z · LW · GW This does seem like a kind of basic problem and maybe hard to resolve without making the proposal more complicated. In particular, the year you have losses you are presumably in a very low bracket indeed. And using a rate from your past would be even more complicated than carrying deductions forward. Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T02:23:03.987Z · LW · GW If you can immediately offset all your losses such tax basically feels like the government de-levereging you (taking a percentage of all gains and losses). I.e. you can get the same outcome as without the tax by investing times more. Yeah, that's the hope. Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T02:22:41.266Z · LW · GW I'm imagining everyone gets the money back. In order to lose money you have to made it, right? Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T00:15:52.913Z · LW · GW people buy lottery tickets. So, we should expect the government to lose money (if possible): • from people taking on stupid risks The government loses money if the average investor is losing money. But that's not true, the average investor makes a lot of money. (And once you weight by tax %, it's even more stark.) I guess you could think there are some investors who go bust (losing too large a fraction of their ordinary income to ever collect on the rebate) while others won't change their behavior at all under this policy. But I do think we need to get into some kind of quantitative model of that (and it doesn't look super likely to me). Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T00:13:40.459Z · LW · GW Is there enough money in the world for all investments to lever up 100%? Yes :) But in some sense this is just equivalent to the worry that interest rates will go up. There's certainly not enough that the borrowing costs would be trivial, if debt demand were suddenly so high. It's possible that short-term interest rates would go up. This is basically the government financing large investments to match private returns, and the extra borrowing can drive up the interest rate. Also, 100% leverage doubles the risk for the same return (by hypothesis) which probably needs some more support before it's clear that that is socially better compared to status quo. This proposal introduces some additional variance in tax revenue (it doesn't increase variance for the taxpayer, since they are just giving a slice of the profits or losses to the government while keeping the same$---and the same variance---for themselves). I agree it's complicated whether you are actually coming out ahead though I think you probably are by a fairly large margin, similar to a sovereign wealth fund. Certainly if people think that the excess returns to capital are an undesirable source of inequality, they should definitely want to do something like this.

A better model is that investment capital seeks the best risk adjusted return. Right now there's a balance between opportunities in debt, equities, real assets, etc. If you increase taxes and therefore decrease return on equities, enough capital will move out to other asset classes until the risk adjusted returns are roughly equal.

Maybe that new equilibrium is better or maybe it's worse, but denying that it will change I think makes your analysis hard to accept.

I think the thing that needs analysis is what this does to the government's balance sheet, and especially the impact of the extra variance in tax revenues.

(The returns to equities will tend to fall because the government is effectively running a giant sovereign wealth fund, matching all investment from the private sector. They won't fall because of taxes reducing the returns though, at least not if investors are profit-maximizing.)

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T00:08:43.968Z · LW · GW

I think this policy is essentially equivalent to progressive consumption taxes, and I'm suggesting it because it appears to require relatively technical changes (allowing rebates + a small adjustment to cost basis) whereas progressive consumption taxes are a bit complex.

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T00:07:45.255Z · LW · GW

I think the case for efficiency improvements is fairly strong, but you can evaluate it as you will. This post is unusual for my blog in that it gives fairness arguments instead of efficiency arguments, but I've discussed the efficiency arguments before (including in the linked FB comment, and implicitly here).

Comment by paulfchristiano on Improving capital gains taxes · 2021-07-10T00:05:22.811Z · LW · GW

The other two however sound that they add a lot of complexity and proposal to add complexity to the tax code because of wage thoughts about fairness are what brought us the mess that we have right now. Any additional increases in complexity should come with a justification of why the increase in complexity is necessary.

Why do they sound that way?

1. Carrying deductions forward is quite complicated in my experience, and there are a ton of rules about what can be deducted from what. Paying people back when they have a negative tax bill is not complicated (and already done for many Americans who overpay throughout the year).
2. When you report taxes you already say what your basis was and when you bought the thing. Adjusting your basis by the risk-free rate amounts to looking up a single number and then multiplying it by the basis. But this change makes the investor neutral about when taxes get paid, so you could just remove the comically complicated list of rules that currently exist to control when gains are realized.
3. (I'm not sure if you've ever actually dealt with changes, but the cognitive savings would be huge under this regime.)
4. I think both of those are net simplifications. But the additional complexity from each of them is also completely trivial compared to the classification of types of income.

It's possible that you can't realize any of those simplifications (e.g. that you'd still need to classify all your kinds of income even if they are taxed at the same rate, and that we'd still maintain wash sale rules and so on) because the code is sufficiently ossified that pointless and irrelevant complexities will remain in there forever. But even in the very worst case, I think you are probably misunderstanding where the complexity and compliance costs of the tax code comes from, I don't think this is a noticeable increase (and in $value it would be totally swamped by savings for tax planning even if you weren't able to simplify the code in all the natural ways). Comment by paulfchristiano on paulfchristiano's Shortform · 2021-07-09T23:43:54.874Z · LW · GW We could try to exploit some further structural facts about the parts of that are used by . For example, it feels like the intended model is going to be leveraging facts that are further "upstream." For example, suppose an attacker observes that there is a cat in the room, and so writes out "There is a cat in the room" as part of a natural-language description of what it's going on that it hopes that will eventually learn to copy. If predicts the adversary's output, it must first predict that there is actually a cat in the room, which then ultimately flows downstream into predictions of the adversary's behavior. And so we might hope to prefer the "intended" by having it preferentially read from the earlier activations (with shorter computational histories). The natural way to implement this is to penalize not for the computation it does, but for all the computation needed to compute its output (including within .). The basic problem with this approach is that it incentivizes to do all of the computation of from scratch in a way optimized for speed rather than complexity. I'd set this approach aside for a while because of this difficulty and the unnaturalness mentioned in the sibling (where we've given up on what seems to be an important form of parameter-sharing). Today I was thinking about some apparently-totally-different angles of attack for the ontology identification problem, and this idea seems to have emerged again, with a potential strategy for fixing the "recompute problem". (In the context of ontology identification, the parameter-sharing objection no longer applies.) Here's the idea: • Train as before. • Start with a bunch of facts and probabilistic relationships that the human knows, expressed in their own ontology. These might be facts like "Ice melts at 100 degrees" or "Dogs bark at cats" or whatever. • We are going to try to jointly learn (i) a correspondence between and the human's ontology, (ii) a set of "justifications" showing that satisfies all of the relationships the human expects. I'm imagining justifications like simulating ice at 100 degrees and observing that it indeed melts, or sampling situations with dogs and cats and verifying that the dogs bark. • The correspondence is constrained to map "the observations" (a concept in the human's ontology) to the output of , but other than that there is no simplicity prior, it can be anything. • Our goal is to make the justifications as computationally simple as possible. (Which indirectly incentivizes us to make as computationally simple as possible.) This still feels a bit weird, but you could imagine it handling a bunch of cases in a promising way: • has no reason to "reuse the human" or to use the human's inference procedure in order to save bits, because doing either of those things makes the justifications more computationally complex. • If there is a human embedded inside who performs the correspondence and writes it down in an easy-to-read way, has no motivation to read it: if is defined in that way, then justifying facts will require simulating the human (even if that work was already done inside ). If simply cut out the middle man and applied the correspondence itself, then it could save compute in the typical case (except when talking about facts about that human). This is subtle in a few ways but tentatively looks plausible to me. • has no reason to ignore and implement a new more-efficient-but-more-complex simulation , because (i) it ultimately needs to relate observations back to the output of , and many of its concepts are related to observations (e.g. what cats look like), (ii) that forces and to have the same behavior, (iii) the justification would then need to show that the the "observations" in are the same as the observations in , which is computationally costly. But right now it's a pretty vague proposal, because it's unclear what the nature of these facts or justifications are. If you set that up in a naive way, then the justification effectively just needs to simulate all of . That's a problem because it reintroduces the failure mode where you need to simulate the human, and therefore there's no extra cost to just simulating and then listening to whatever they say. Overall I think that probably nothing like this works, but I'm still feeling a lot more optimistic than I was last week and want to explore it further. (This is partially for reasons not discussed in this comment, that several other approaches/motivations seem to converge on something similar.) Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-09T20:06:40.746Z · LW · GW I like using Fritz. It sounds like we are on basically the same page about what experiments would be interesting. Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-09T18:33:55.036Z · LW · GW i) I'm interested in any good+scalable old engine. I think it's reasonable to focus on something easy, the most important constraint is that it is really state of the art and scales up pretty gracefully. I'd prefer 2000 or earlier. ii) It would be great if where was at least a complete description (stuff like: these numbers were looked up from this source with links, the population was made of the following engines with implementations from this link, here's the big table of game results and the elo calculation, here was the code that was run to estimate nodes/sec). iii) For the "old" experiment I'd like to use memory from the reference machine from the old period. I'd prefer basically remove endgame tables and opening book. My ideal would be to pick a particular "old" year as the focus. Ideally that would be a year for which we (a) have an implementation of the engine, (b) have representative hardware from the period that we can use to compute nodes/sec for each of our engines. Then I'm interested in: • Compute nodes/sec for the old and new engine on both the old and new hardware. This gives us 4 numbers. • Evaluate elos both of those engines, running on both "old memory" and "new memory," as a function of nodes/turn. This gives us 4 graphs. (I assume that memory affects performance slightly independently of nodes/turn, at least for the new engine? If nodes/turn is the wrong measure, whatever other measure of computational cost makes sense, the important thing is that the cost is linear in the measurement.) Comment by paulfchristiano on Measuring hardware overhang · 2021-07-09T18:15:03.151Z · LW · GW How are those MIPS numbers produced? My impression was that the raw numbers were nodes/sec, and then some calibration was done to relate this to MIPS? Comment by paulfchristiano on Improving capital gains taxes · 2021-07-09T15:21:47.000Z · LW · GW I don't think that's the case for this proposal. Suppose that I would have invested$X in the absence of taxes, but I am now subject to a 50% tax rate. For simplicity assume the risk-free rate is 0 (it doesn't change the calculation).

Instead of investing $X, I will now invest$2X (potentially taking out a 0-interest loan to do it).

If $X would have earned$Y of investment income/loss, then my $2X investment will earn$2Y of income/loss. I pay 50% of this as taxes. So my take-home income is $Y---exactly the same as if there had been no taxes. This does discourage me from spending time finding good investments if they aren't scalable---but only in exactly the same way as it discourages all other labor, so in this case it's fixing a bug in the current tax code. In fact, I claim that any non-scalable investment was actually totally reasonable to tax, and we're discouraging exactly the right set of stuff. Comment by paulfchristiano on Improving capital gains taxes · 2021-07-09T15:15:32.834Z · LW · GW That's not my guess but it seems plausible. Do you have some explanation/argument/calculation/intuition? ETA: I actually thought that part 2 would increase tax revenues all on its own, though that might be making unrealistic assumptions about investor rationality. Not sure if you are referring to part 2, or to the whole package. Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-08T22:24:53.099Z · LW · GW To clarify my stance on prizes: • I will probably offer Gwern a$100 small prize for the link.
• I will probably offer hippke a $1000 prize for the prior work. • I would probably have offered hippke something like a$3000 prize if the experiment hadn't already been done.
• The main thing to make the prize bigger would have been (i) doing the other half, of evaluating old engines on new hardware, (ii) more clarity about the numbers including publishing the raw data and ideally sufficiently detailed instructions for reproducing, (iii) more careful controls for memory, endgame tables, (iv) I would post a call for critiques to highlight reservations with the numbers before awarding the rest of the prize.
• Someone could still earn a \$10,000 prize for closing all of those gaps (and hippke could earn some large fraction of this).
Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-08T21:52:36.123Z · LW · GW

Is your prediction that e.g. the behavior of chess will be unrelated to the behavior of SAT solving, or to factoring? Or that "those kinds of things" can be related to each other but not to image classification? Or is your prediction that the "new regime" for chess (now that ML is involved) will look qualitatively different than the old regime?

There are problems where one paper reduces the compute requirements by 20 orders of magnitude. Or gets us from couldn't do X at all, to able to do X easily.

I'm aware of very few examples of that occurring for problems that anyone cared about (i.e. in all such cases we found the breakthroughs before they mattered, not after). Are you aware of any?

a prime factoring algorithm is maths

Factoring algorithms, or primality checking, seem like fine domains to study to me. I'm also interested in those and would be happy to offer similar bounties for similar analyses.

You have a spectrum of possible reference classes for transformative AI that range from the almost purely software driven progress, to the almost totally hardware driven progress.

I think it's pretty easy to talk about what distinguishes chess, SAT, classification, or factoring from multiplication. And I'm very comfortable predicting that the kind of AI that helps with R&D is more like the first four than like the last (though these things are surely on a spectrum).

You may have different intuitions, I think that's fine, in which case this explains part of why this data is more interesting to me than you.

Progress on chess AI's contained no breakthroughs, no fundamental insights, only a slow accumulation of little tricks.

Can you point to a domain where increasing R&D led to big insights that improved performance?

Perhaps more importantly, machine learning is also "a slow accumulation of little tricks," so the analogy seems fine to me. (You might think that future AI is totally different, which is fine and not something I want to argue about here.)

To gain more info about transformative AI, someone would have to make either a good case for why it should be at a particular position on the scale, or a good case for why its position on the scale should be similar to the position of some previous piece of past research. In the latter case, we can gain from examining the position of that research topic. If hypothetically that topic was chess, then the research you propose would be useful. If the reason you chose chess was purely that you thought it was easier to measure, then the results are likely useless.

If Alice says this and so never learns about anything, and Bob instead learns a bunch of facts about a bunch of domains, I'm pretty comfortable betting on Bob being more accurate about most topics.

I think the general point is: different domains differ from one another. You want to learn about a bunch of them and see what's going on, in order to reason about a new domain.

The consistency of chess performance looks like more selection bias. You aren't choosing a problem domain where there was one huge breakthrough that. You are choosing a problem domain that has had slow consistent progress.

I agree with the basic point that board games are selected to be domains where there is an obvious simple thing to do, and so progress started early. I think in that way they are similar to SAT solving and factoring, and (slightly) different from image classification, for which it's arguably hard to measure progress before the 90s.

I think that the more important difference between chess and image classification (in terms of making the chess data clean) is that there is a homogeneous measure of performance for chess, whereas image classification has moved on to harder and harder tasks. I think this does slightly change the nature of the task, but mostly it just makes the data clean.

I think that the main difference between chess and SAT solving is mostly that chess is more naturally interesting to people so they've been working on it longer, and that is a real factor that makes the data cleaner (without making it less useful as an analogy). SAT solving also has some of the image classification problem, of depending a lot on the distribution of instances.

(With respect to "aren't choosing a domain where there was one huge breakthrough," I'm definitely interested in domains with such breakthroughs.)

Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-08T15:52:50.351Z · LW · GW

Old chess engines will have been optimized to use much less memory, where as modern chess engines use a very large hash table.

I'm pretty interested in understanding the size of this effect by scaling down the memory use as well as compute to historical levels. (This is one of my concerns about hippke's experiment, though it seems like they think it's not a big factor.)

Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-08T15:49:26.609Z · LW · GW

Conclusion: Most of what you want to measure comes down to neural network training. The training framework is not directly comparable or backwards-compatible with old techniques, so the experiment formulation has to address this.

This seems right if "the dynamics of ML R&D are unrelated to other software R&D---you can't learn about neural net efficiency improvements by looking at efficiency improvements in other domains." But I'm not so sure about that (and haven't seen any evidence for it).

ETA: to clarify, I'm mostly interested in how much future AI will get improved as we massively scale up R&D investment (by applying AI to AI development). This includes e.g. "Tweaking neural net architectures" or "Better optimization algorithms for neural networks" or "better ways to integrate neural networks with search" or whatever. Those improvements are indeed different from "better forms of tree search" or "better position evaluations" and so on. But I still think they are related---if I learn that for a few different domains "doubling R&D doubles performance," then that gives me evidence that neural net performance will be similar, and if I learn that this kind of return is very rare then I'll be more skeptical about that kind of extrapolation holding up even if I observe it for the first few orders of magnitude for neural networks.

Comment by paulfchristiano on Measuring hardware overhang · 2021-07-08T01:09:38.766Z · LW · GW

From the graph it looks like stockfish is able to match the results of engines from ~2000 using ~1.5 orders of magnitude less compute.

• Is that the right way to read this graph?
• Do you have the numbers for SF8 evaluations so that I can use those directly rather than eyeballing from this graph? (I'm generally interested in whatever raw data you have.)
Comment by paulfchristiano on Measuring hardware overhang · 2021-07-08T00:53:02.517Z · LW · GW

Pulled from the wayback machine

Comment by paulfchristiano on How much chess engine progress is about adapting to bigger computers? · 2021-07-08T00:50:33.175Z · LW · GW

Thanks for the link (and thanks to hippke for doing the experiments), that's great.