Posts

Distributed public goods provision 2020-09-26T21:20:05.352Z · score: 26 (8 votes)
Better priors as a safety problem 2020-07-05T21:20:02.851Z · score: 64 (18 votes)
Learning the prior 2020-07-05T21:00:01.192Z · score: 75 (18 votes)
Inaccessible information 2020-06-03T05:10:02.844Z · score: 85 (27 votes)
Writeup: Progress on AI Safety via Debate 2020-02-05T21:04:05.303Z · score: 95 (27 votes)
Hedonic asymmetries 2020-01-26T02:10:01.323Z · score: 84 (30 votes)
Moral public goods 2020-01-26T00:10:01.803Z · score: 125 (43 votes)
Of arguments and wagers 2020-01-10T22:20:02.213Z · score: 58 (19 votes)
Prediction markets for internet points? 2019-10-27T19:30:00.898Z · score: 40 (18 votes)
AI alignment landscape 2019-10-13T02:10:01.135Z · score: 43 (16 votes)
Taxing investment income is complicated 2019-09-22T01:30:01.242Z · score: 34 (13 votes)
The strategy-stealing assumption 2019-09-16T15:23:25.339Z · score: 68 (18 votes)
Reframing the evolutionary benefit of sex 2019-09-14T17:00:01.184Z · score: 67 (23 votes)
Ought: why it matters and ways to help 2019-07-25T18:00:27.918Z · score: 88 (36 votes)
Aligning a toy model of optimization 2019-06-28T20:23:51.337Z · score: 52 (17 votes)
What failure looks like 2019-03-17T20:18:59.800Z · score: 230 (100 votes)
Security amplification 2019-02-06T17:28:19.995Z · score: 20 (4 votes)
Reliability amplification 2019-01-31T21:12:18.591Z · score: 24 (7 votes)
Techniques for optimizing worst-case performance 2019-01-28T21:29:53.164Z · score: 24 (7 votes)
Thoughts on reward engineering 2019-01-24T20:15:05.251Z · score: 31 (9 votes)
Learning with catastrophes 2019-01-23T03:01:26.397Z · score: 28 (9 votes)
Capability amplification 2019-01-20T07:03:27.879Z · score: 24 (7 votes)
The reward engineering problem 2019-01-16T18:47:24.075Z · score: 24 (5 votes)
Towards formalizing universality 2019-01-13T20:39:21.726Z · score: 29 (6 votes)
Directions and desiderata for AI alignment 2019-01-13T07:47:13.581Z · score: 30 (7 votes)
Ambitious vs. narrow value learning 2019-01-12T06:18:21.747Z · score: 21 (7 votes)
AlphaGo Zero and capability amplification 2019-01-09T00:40:13.391Z · score: 30 (13 votes)
Supervising strong learners by amplifying weak experts 2019-01-06T07:00:58.680Z · score: 29 (8 votes)
Benign model-free RL 2018-12-02T04:10:45.205Z · score: 14 (5 votes)
Corrigibility 2018-11-27T21:50:10.517Z · score: 42 (11 votes)
Humans Consulting HCH 2018-11-25T23:18:55.247Z · score: 22 (5 votes)
Approval-directed bootstrapping 2018-11-25T23:18:47.542Z · score: 19 (4 votes)
Approval-directed agents 2018-11-22T21:15:28.956Z · score: 29 (5 votes)
Prosaic AI alignment 2018-11-20T13:56:39.773Z · score: 41 (13 votes)
An unaligned benchmark 2018-11-17T15:51:03.448Z · score: 28 (7 votes)
Clarifying "AI Alignment" 2018-11-15T14:41:57.599Z · score: 62 (19 votes)
The Steering Problem 2018-11-13T17:14:56.557Z · score: 40 (12 votes)
Preface to the sequence on iterated amplification 2018-11-10T13:24:13.200Z · score: 42 (16 votes)
The easy goal inference problem is still hard 2018-11-03T14:41:55.464Z · score: 42 (13 votes)
Could we send a message to the distant future? 2018-06-09T04:27:00.544Z · score: 40 (14 votes)
When is unaligned AI morally valuable? 2018-05-25T01:57:55.579Z · score: 102 (32 votes)
Open question: are minimal circuits daemon-free? 2018-05-05T22:40:20.509Z · score: 128 (39 votes)
Weird question: could we see distant aliens? 2018-04-20T06:40:18.022Z · score: 85 (25 votes)
Implicit extortion 2018-04-13T16:33:21.503Z · score: 74 (22 votes)
Prize for probable problems 2018-03-08T16:58:11.536Z · score: 144 (39 votes)
Argument, intuition, and recursion 2018-03-05T01:37:36.120Z · score: 103 (31 votes)
Funding for AI alignment research 2018-03-03T21:52:50.715Z · score: 108 (29 votes)
Funding for independent AI alignment research 2018-03-03T21:44:44.000Z · score: 5 (1 votes)
The abruptness of nuclear weapons 2018-02-25T17:40:35.656Z · score: 105 (37 votes)
Arguments about fast takeoff 2018-02-25T04:53:36.083Z · score: 117 (39 votes)

Comments

Comment by paulfchristiano on Puzzle Games · 2020-09-28T16:11:10.406Z · score: 2 (1 votes) · LW · GW

Question about Monster's Expedition:

The reset mechanic seems necessary to make the game playable in practice, but it seems very unsatisfying. It is unclear how you'd make it work in a principled way; the actual implementation seems extremely confusing, seems to depend on invisible information about the environment, and has some weird behaviors that I think are probably bugs. Unfortunately, it currently seems possible that probing the weirdest behaviors of resetting (e.g. breaking conservation-of-trees) could be the only way to access some places. It's also possible that mundane applications of resetting are essential but you aren't intended to explore weird edge cases, which would be the least satisfying outcome of all.

So two questions:

1. Is it possible to beat the game without resetting? Can I safely ignore it as a mechanic? This is my current default assumption and it's working fine so far.

2. Is the reset mechanic actually lawful/reasonable and I just need to think harder?

(Given the quality of the game I'm hoping that at least one of those is "yes." If "no answer" seems like the best way to enjoy the game I'm open to that as well.)

Comment by paulfchristiano on Puzzle Games · 2020-09-27T23:19:31.335Z · score: 6 (3 votes) · LW · GW

This list is almost the same as mine. I would include Hanano Puzzle 2 at tier 2 and Cosmic Express at tier 3. I haven't played Twisty Little Passages or Kine though I'll try them on this recommendation.

We're putting together a self-contained campaign for engine-game.com which is aiming to be Tier-2-according-to-Paul. We'll see if other folks agree when it's done. It has a very different flavor from the other games on the list.

Comment by paulfchristiano on Distributed public goods provision · 2020-09-27T16:24:08.616Z · score: 6 (3 votes) · LW · GW

I think you inevitably need to answer "What is the marginal impact of funding?" if you are deciding how much to fund something.

(I will probably write about approaches to randomizing to be maximally efficient with research time at some point in the future. My current plan is something like: write out public goods I know of that I benefit from, then sample one of them to research and fund by 1/p(chosen) more than I normally would.)

This isn't really meant to be a quick rule of thumb, it's meant to be a way to answer the question at all.

Comment by paulfchristiano on Search versus design · 2020-08-17T14:46:52.045Z · score: 21 (11 votes) · LW · GW

I liked this post.

I'm not sure that design will end up being as simple as this picture makes it look, no matter how well we understand it---it seems like factorization is one kind of activity in design, but it feels like overall "design" is being used as a kind of catch-all that is probably very complicated.

An important distinction for me is: does the artifact work because of the story (as in "design"), or does the artifact work because of the evaluation (as in search)?

This isn't so clean, since:

  • Most artifacts work for a combination of the two reasons---I design a thing then test it and need a few iterations---there is some quantitative story where both factors almost always play a role for practical artifacts.
  • There seem to many other reasons things work (e.g. "it's similar to other things that worked" seems to play a super important role in both design and search).
  • A story seems like it's the same kind of thing as an artifact, and we could also talk about where *it* comes from. A story that plays a role in a design itself comes from some combination of search and design.
  • During design it seems likely that humans rely very extensively on searching against mental models, which may not be introspectively available to us as a search but seems like it has similar properties.

Despite those and more complexities, it feels to me like if there is a clean abstraction it's somewhere in that general space, about the different reasons why a thing can work.

Post-hoc stories are clearly *not* the "reason why things work" (at least at this level of explanation). But also if you do jointly search for a model+helpful story about it, the story still isn't the reason why the model works, and from a safety perspective it might be similarly bad.

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-22T01:41:35.308Z · score: 12 (3 votes) · LW · GW
Yeah, I've heard (through the grapevine) that Paul and Geoffrey Irving think debate and factored cognition are tightly connected

For reference, this is the topic of section 7 of AI Safety via Debate.

In the limit they seem equivalent: (i) it's easy for HCH(with X minutes) to discover the equilibrium of a debate game where the judge has X minutes, (ii) a human with X minutes can judge a debate about what would be done by HCH(with X minutes).

The ML training strategies also seem extremely similar, in the sense that the difference between them is smaller than design choices within each of them, though that's a more detailed discussion.

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-22T01:34:35.946Z · score: 4 (2 votes) · LW · GW
I'm a bit confused why you would make the debate length known to the debaters. This seems to allow them to make indefensible statements at the very end of a debate, secure in the knowledge that they can't be critiqued. One step before the end, they can make statements which can't be convincingly critiqued in one step. And so on.
[...]
The most salient reason for me ATM is the concern that debaters needn't structure their arguments as DAGs which ground out in human-verifiable premises, but rather, can make large circular arguments (too large for the debate structure to catch) or unbounded argument chains (or simply very very high depth argument trees, which contain a flaw at a point far too deep for debate to find).

If I assert "X because Y & Z" and the depth limit is 0, you aren't intended to say "Yup, checks out," unless Y and Z and the implication are self-evident to you. Low-depth debates are supposed to ground out with the judge's priors / low-confidence in things that aren't easy to establish directly (because if I'm only updating on "Y looks plausible in a very low-depth debate" then I'm going to say "I don't know but I suspect X" is a better answer than "definitely X"). That seems like a consequence of the norms in my original answer.

In this context, a circular argument just isn't very appealing. At the bottom you are going to be very uncertain, and all that uncertainty is going to propagate all the way up.

Instead, it seems like you'd want the debate to end randomly, according to a memoryless distribution. This way, the expected future debate length is the same at all times, meaning that any statement made at any point is facing the same expected demand of defensibility.

If you do it this way the debate really doesn't seem to work, as you point out.

I currently think all my concerns can be addressed if we abandon the link to factored cognition and defend a less ambitious thesis about debate.

For my part I mostly care about the ambitious thesis.

If the two players choose simultaneously, then it's hard to see how to discourage them from selecting the same answer. This seems likely at late stages due to convergence, and also likely at early stages due to the fact that both players actually use the same NN. This again seriously reduces the training signal.
If player 2 chooses an answer after player 1 (getting access to player 1's answer in order to select a different one), then assuming competent play, player 1's answer will almost always be the better one. This prior taints the judge's decision in a way which seems to seriously reduce the training signal and threaten the desired equilibrium.

I disagree with both of these as objections to the basic strategy, but don't think they are very important.

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-19T21:09:26.655Z · score: 16 (5 votes) · LW · GW

Sorry for not understanding how much context was missing here.

The right starting point for your question is this writeup which describes the state of debate experiments at OpenAI as of end-of-2019 including the rules we were using at that time. Those rules are a work in progress but I think they are good enough for the purpose of this discussion.

In those rules: If we are running a depth-T+1 debate about X and we encounter a disagreement about Y, then we start a depth-T debate about Y and judge exclusively based on that. We totally ignore the disagreement about X.

Our current rules---to hopefully be published sometime this quarter---handle recursion in a slightly more nuanced way. In the current rules, after debating Y we should return to the original debate. We allow the debaters to make a new set of arguments, and it may be that one debater now realizes they should concede, but it's important that a debater who had previously made an untenable claim about X will eventually pay a penalty for doing so (in addition to whatever payoff they receive in the debate about Y). I don't expect this paragraph to be clear and don't think it's worth getting into until we publish an update, but wanted to flag it.

Do the debaters know how long the debate is going to be?

Yes.

To what extent are you trying to claim some relationship between the judge strategy you're describing and the honest one? EG, that it's eventually close to honest judging? (I'm asking whether this seems like an important question for the discussion vs one which should be set aside.)

If debate works, then at equilibrium the judge will always be favoring the better answer. If furthermore the judge believes that debate works, then this will also be their honest belief. So if judges believe in debate then it looks to me like the judging strategy must eventually approximate honest judging. But this is downstream of debate working, it doesn't play an important role in the argumetn that debate works or anything like that.

Comment by paulfchristiano on Challenges to Christiano’s capability amplification proposal · 2020-07-18T05:58:41.419Z · score: 6 (3 votes) · LW · GW

Providing context for readers: here is a post someone wrote a few years ago about issues (ii)+(iii) which I assume is the kind of thing Czynski has in mind. The most relevant thing I've written on issues (ii)+(iii) are Universality and consequentialism within HCH, and prior to that Security amplification and Reliability amplification.

Comment by paulfchristiano on Challenges to Christiano’s capability amplification proposal · 2020-07-18T03:45:56.591Z · score: 11 (6 votes) · LW · GW

I think not.

For the kinds of questions discussed in this post, which I think are easier than "Design Hessian-Free Optimization" but face basically the same problems, I think we are making reasonable progress. I'm overall happy with the progress but readily admit that it is much slower than I had hoped. I've certainly made updates (mostly about people, institutions, and getting things done, but naturally you should update differently).

Note that I don't think "Design Hessian-Free Optimization" is amongst the harder cases, and these physics problems are a further step easier than that. I think that sufficient progress on these physics tasks would satisfy the spirit of my remark 2y ago.

I appreciate the reminder at the 2y mark. You are welcome to check back in 1y later and if things don't look much better (at least on this kind of "easy" case), treat it as a further independent update.

Comment by paulfchristiano on Challenges to Christiano’s capability amplification proposal · 2020-07-18T03:31:11.356Z · score: 7 (5 votes) · LW · GW
To claim that you have removed optimization pressure to be unaligned

The goal is to remove the optimization pressure to be misaligned, and that's the reason you might hope for the system to be aligned. Where did I make the stronger claim you're attributing to me?

I'm happy to edit the offending text, I often write sloppily. But Rohin is summarizing the part of this post where I wrote "The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn't a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren't internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process." So in this case it seems clear that I was stating a goal.

Even among normal humans there are principal-agent problems.

In the scenario of a human principal delegating to a human agent there is a huge amount of optimization pressure to be misaligned. All of the agents' evolutionary history and cognition. So I don't think the word "even" belongs here.

There is optimization pressure to be unaligned; of course there is!

I agree that there are many possible malign optimization pressures, e.g.: (i) the optimization done deliberately by those humans as part of being competitive, which they may not be able to align, (ii) "memetic" selection amongst patterns propagating through the humans, (iii) malign consequentialism that arises sometimes in the human policy (either randomly or in some situations). I've written about these and it should be obvious they are something I think a lot about, am struggling with, and believe there are plausible approaches to dealing with.

(I think it would be defensible for you to say something like "I don't believe that Paul's writings give any real reason for optimism on these points and the fact that he finds them reassuring seems to indicate wishful thinking," and if that's a fair description of your position then we can leave it at that.)

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-17T15:11:35.499Z · score: 2 (1 votes) · LW · GW
Do you mean that every debater could have defended each of their statements s in a debate which lasted an additional N steps after s was made? What happens if some statements are challenged? And what exactly does it mean to defend statements from a challenge?

Yes. N is the remaining length of the debate. As discussed in the paper, when one player thinks that the other is making an indefensible claim then we zoom in on the subclaim and use the remaining time to resolve it.

I get the feeling you're suggesting something similar to the high school debate rule (which I rejected but didn't analyze very much), where unrefuted statements are assumed to be established (unless patently false), refutations are assumed decisive unless they themselves are refuted, etc.

There is a time/depth limit. A discussion between two people can end up with one answer that is unchallenged, or two proposals that everyone agrees can't be resolved in the remaining time. If there are conflicting answers that debaters don't expect to be able to resolve in the remaining time, the strength of inference will depend on how much time is remaining, and will mean nothing if there is no remaining time.

At the end of training, isn't the idea that the first player is winning a lot, since the first player can choose the best answer?

I'm describing what you should infer about an issue that has come up where neither player wants to challenge the other's stance.

Are agents really incentivized to justify their assertions?

Under the norms I proposed in the grandparent, if one player justifies and the other doesn't (nor challenge the justification), the one who justifies will win. So it seems like they are incentivized to justify.

Are those justifications incentivized to be honest?

If they are dishonest then the other player has the opportunity to challenge them. So initially making a dishonest justification may be totally fine, but eventually the other player will learn to challenge and you will need to be honest in order to defend.

In the cases where the justifications aren't fully verifiable, does it really make sense for the humans to trust anything they say? In particular, given the likelihood that one of the agents is lying?

It's definitely an open question how much can be justified in a depth N debate.

I recognize that you're saying these are open questions, I'm just trying to highlight where I'm confused -- particularly as these questions are bound up with the question of what judge strategies should look like. It seems like a lot of pieces need to come together in just the right way, and I'm not currently seeing how judge strategies can simultaneously accomplish everything they need to.

It seems like the only ambiguity in the proposal in the grandparent is: "How much should you infer from the fact that a statement can be defended in a length T debate?" I agree that we need to answer this question to make the debate fully specified (of course we wanted to answer it anyway in order to use debate). My impression is that isn't what you are confused about and that there's a more basic communication problem.

In practice this doesn't seem to be an important part of the difficulty in getting debates to work, for the reasons I sketched above---debaters are free what justifications they give, so a good debater at depth T+1 will give statements that can be justified at depth T (in the sense that a conflicting opinion with a different upshot couldn't be defended at depth T), and the judge will basically ignore statements where conflicting positions can both be justified at depth T. It seems likely there is some way to revise the rules so that the judge instructions don't have to depend on "assume that answer can be defended at depth T" but it doesn't seem like a priority.

Comment by paulfchristiano on How should AI debate be judged? · 2020-07-16T04:56:31.443Z · score: 23 (9 votes) · LW · GW

Your debate comes with some time limit T.

If T=0, use your best guess after looking at what the debaters said.

If T=N+1 and no debater challenges any of their opponent's statements, then give your best answer assuming that every debater could have defended each of their statements from a challenge in a length-N debate.

Of course this assumption won't be valid at the beginning of training. And even at the end of training we really only know something weaker like: "Neither debater thinks they would win by a significant expected margin in a length N debate."

What can you infer if you see answers A and B to a question and know that both of them are defensible (in expectation) in a depth-N debate? That's basically the open research question, with the hope being that you inductively make stronger and stronger inferences for larger N.

(This is very similar to asking when iterated amplification produces a good answer, up to the ambiguity about how you sample questions in amplification.)

(When we actually give judges instructions for now we just tell them to assume that both debater's answers are reasonable. If one debater gives arguments where the opposite claim would also be "reasonable," and the other debater gives arguments that are simple enough to be conclusively supported with the available depth, then the more helpful debater usually wins. Overall I don't think that precision about this is a bottleneck right now.)

Comment by paulfchristiano on Learning the prior · 2020-07-08T00:57:43.765Z · score: 4 (2 votes) · LW · GW

Yeah, that's my view.

Comment by paulfchristiano on Better priors as a safety problem · 2020-07-08T00:56:20.397Z · score: 2 (1 votes) · LW · GW
To the extent that we instincitively believe or disbelieve this, it's not for the right reasons - natural selection didn't have any evidence to go on. At most, that instinct is a useful workaround for the existential dread glitch.

To the extent that we believe this correctly, it's for the same reasons that we are able to do math and philosophy correctly (or at least more correctly than chance :) despite natural selection not caring about it much. It's the same reason that you can correctly make arguments like the one in your comment.

Comment by paulfchristiano on Better priors as a safety problem · 2020-07-08T00:54:03.253Z · score: 4 (2 votes) · LW · GW

I think that under the counting measure, the vast majority of people like us are in simulations (ignoring subtleties with infinities that make that statement meaningless).

I think that under a more realistic measure, it's unclear whether or not most people like us are in simulations.

Those statements are unrelated to what I was getting at in the post though, which is more like: the simulation argument rests on us being the kind of people who are likely to be simulated, we don't think that everyone should believe they are in a simulation because the simulators are more likely to simulate realistic-looking worlds than reality is to produce realistic-looking worlds, that seems absurd.

The whole thing is kind of a complicated mess and I wanted to skip it by brushing aside the simulation argument. Maybe should have just not mentioned it at all given that the simulation argument makes such a mess of it. I don't expect to be able to get clarity in this thread either :)

I think the reason why the hypothesis that the world is a dream seems absurd has very little to do with likelihood ratios and everything to do with heuristics like "don't trust things that sound like what a crazy person, drug-addled person, or mystic would say."

It's not the hypothesis that's absurd, it's this particular argument.

Comment by paulfchristiano on Learning the prior · 2020-07-08T00:46:38.228Z · score: 6 (3 votes) · LW · GW

That's right---you still only get a bound on average quality, and you need to do something to cope with failures so rare they never appear in training (here's a post reviewing my best guesses).

But before you weren't even in the game, it wouldn't matter how well adversarial training worked because you didn't even have the knowledge to tell whether a given behavior is good or bad. You weren't even getting the right behavior on average.

(In the OP I think the claim "the generalization is now coming entirely from human beliefs" is fine, I meant generalization from one distribution to another. "Neural nets are are fine" was sweeping these issues under the rug. Though note that in the real world the distribution will change from neural net training to deployment, it's just exactly the normal robustness problem. The point of this post is just to get it down to only a robustness problem that you could solve with some kind of generalization of adversarial training, the reason to set it up as in the OP was to make the issue more clear.)

Comment by paulfchristiano on Learning the prior · 2020-07-07T00:59:51.934Z · score: 2 (1 votes) · LW · GW
So even when you talk about amplifying f, you mean a certain way of extending human predictions to more complicated background information (e.g. via breaking down Z into chunks and then using copies of f that have been trained on smaller Z), not fine-tuning f to make better predictions.

That's right, f is either imitating a human, or it's trained by iterated amplification / debate---in any case the loss function is defined by the human. In no case is f optimized to make good predictions about the underlying data.

My impression is that your hope is that if Z and f start out human-like, then this is like specifying the "programming language" of a universal prior, so that search for highly-predictive Z, decoded through f, will give something that uses human concepts in predicting the world.

Z should always be a human-readable (or amplified-human-readable) latent; it will necessarily remain human-readable because it has no purpose other than to help a human make predictions. f is going to remain human-like because it's predicting what the human would say (or what the human-consulting-f would say etc.).

The amplified human is like the programming language of the universal prior, Z is like the program that is chosen (or slightly more precisely: Z is like a distribution over programs, described in a human-comprehensible way) and f is an efficient distillation of the intractable ideal.

Comment by paulfchristiano on Learning the prior · 2020-07-06T03:30:40.809Z · score: 4 (2 votes) · LW · GW
I'm not totally sure what actually distinguishes f and Z, especially once you start jointly optimizing them. If f incorporates background knowledge about the world, it can do better at prediction tasks. Normally we imagine f having many more parameters than Z, and so being more likely to squirrel away extra facts, but if Z is large then we might imagine it containing computationally interesting artifacts like patterns that are designed to train a trainable f on background knowledge in a way that doesn't look much like human-written text.

f is just predicting P(y|x, Z), it's not trying to model D. So you don't gain anything by putting facts about the data distribution in f---you have to put them in Z so that it changes P(y|x,Z).

Now, maybe you can try to ensure that Z is at least somewhat textlike via making sure it's not too easy for a discriminator to tell from human text, or requiring it to play some functional role in a pure text generator, or whatever.

The only thing Z does is get handed to the human for computing P(y|x,Z).

Comment by paulfchristiano on Learning the prior · 2020-07-06T03:27:22.524Z · score: 8 (4 votes) · LW · GW

The difference is that you can draw as many samples as you want from D* and they are all iid. Neural nets are fine in that regime.

Comment by paulfchristiano on AI Unsafety via Non-Zero-Sum Debate · 2020-07-06T00:10:04.628Z · score: 4 (2 votes) · LW · GW

It seems even worse than any of that. If your AI wanted anything at all it might debate well in order to survive. So if you are banking on it single-mindedly wanting to win the debate then you were already in deep trouble.

Comment by paulfchristiano on Second Wave Covid Deaths? · 2020-07-04T00:31:59.721Z · score: 12 (3 votes) · LW · GW
I don't understand how the second wave can't be explained by increase in testing. Before only people who were sick were allowed to be tested, who correlate more with hospital visits, which correlates more with deaths, so it more closely follows the death graph.

US positive test rate is up from 4.4% to 7.4%: https://coronavirus.jhu.edu/testing/individual-states

It used to be the case that 4.4% of people you tested had COVID-19.

Now you test more people, who look less risky on average, and find that 7.4% of people you test have COVID-19. The people you would have tested in the old days are the riskiest subgroup, so more than 7.4% of them have COVID-19.

So it sure seems like the infection rate went up by at least (7.4/4.4) = +70%.

Comment by paulfchristiano on High Stock Prices Make Sense Right Now · 2020-07-04T00:18:48.828Z · score: 4 (2 votes) · LW · GW

My impression is that most individual investors and pension funds put a significant part of their portfolio into bonds.

Comment by paulfchristiano on High Stock Prices Make Sense Right Now · 2020-07-04T00:09:02.642Z · score: 9 (5 votes) · LW · GW

I'd love to get evidence on that and it seems important.

Your position doesn't sound right to me. You don't need many people changing their allocations moderately to totally swamp a 1% change in inflows.

My guess would be that more than 10% of investors, weighted by total equity holdings, adjust their exposure deliberately, but I'd love to know the real numbers.

Comment by paulfchristiano on The "AI Debate" Debate · 2020-07-03T23:57:16.832Z · score: 2 (1 votes) · LW · GW
Do you think something like IDA is the only plausible approach to alignment? If so, I hadn't realized that, and I'd be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: "any agent (we make) that learns to act will be treacherous if treachery is possible." Are all learning agents fundamentally out to get you? I suppose that's a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn't be recognized

No, but what are the approaches to avoiding deceptive alignment that don't go through competitiveness?

I guess the obvious one is "don't use ML," and I agree that doesn't require competitiveness.

Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.

No, but now we are starting to play the game of throttling the overseee (to avoid it overpowering the overseer) and it's not clear how this is going to work and be stable. It currently seems like the only appealing approach to getting stability there is to ensure the overseer is competitive.

Comment by paulfchristiano on The "AI Debate" Debate · 2020-07-03T23:55:34.977Z · score: 2 (1 votes) · LW · GW
This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn't kill you (and helps you achieve your other goals). But it seems you're saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.

I'm saying that if you can't protect yourself from an AI in your lab, under conditions that you carefully control, you probably couldn't protect yourself from AI systems out there in the world.

The hope is that you can protect yourself from an AI in your lab.

Comment by paulfchristiano on The "AI Debate" Debate · 2020-07-03T23:53:10.540Z · score: 4 (2 votes) · LW · GW
So competitiveness still matters somewhat, but here's a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker.

Definitely a disagreement, I think that before anyone has an AGI that could beat humans in a fistfight, tons of people will have systems much much more valuable than a mechanical turk worker.

Comment by paulfchristiano on The "AI Debate" Debate · 2020-07-03T23:50:59.784Z · score: 2 (1 votes) · LW · GW
The way I map these concepts, this feels like an elision to me. I understand what you're saying, but I would like to have a term for "this AI isn't trying to kill me", and I think "safe" is a good one. That's the relevant sense of "safe" when I say "if it's safe, we can try it out and tinker". So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.

I mean that we don't have any process that looks like debate that could produce an agent that wasn't trying to kill you without being competitive, because debate relies on using aligned agents to guide the training process (and if they aren't competitive then the agent-being-trained will, at least in the limit, converge to an equilibrium where it kills you).

Comment by paulfchristiano on High Stock Prices Make Sense Right Now · 2020-07-03T22:14:32.302Z · score: 18 (6 votes) · LW · GW

The main reason I'm personally confused is that 2 months ago I thought there was real uncertainty about whether we'd be able to keep the pandemic under control. Over the last 2 months that uncertainty has gradually been resolved in the negative, without much positive news about people's willingness to throw in the towel rather than continuing to panic and do lockdowns, and yet over that period SPY has continued moving up.

I'm making no attempt at all to estimate prices based on fundamentals and I'm honestly not even sure how that exercise is supposed to work. Interest rates are very low and volatility isn't that high so it seems like you would have extremely high equity prices if e.g. most investors were rational with plausible utility functions. But equity prices are never nearly as high as that kind of analysis would suggest.

Comment by paulfchristiano on High Stock Prices Make Sense Right Now · 2020-07-03T22:05:25.056Z · score: 31 (15 votes) · LW · GW

I think people's annual income is on average <20% of their net worth ($100T vs $20T), maybe more like 15%.

So 2 months of +20% savings amounts to <1% increase in total savings, right?

If that's right, this doesn't seem very important relative to small changes in people's average allocation between equities/debt/currency, which fluctuate by 10%s during the normal business cycle.

Normally I expect higher savings rates to represent concern about having money in the future, which will be accompanied by a move to safer assets. And of course volatility is currently way up, so rational investors probably couldn't afford to invest nearly as much in stocks unless they were being compensated with significantly higher returns (which should involve prices only returning to normal levels as volatility falls).

Comment by paulfchristiano on The "AI Debate" Debate · 2020-07-03T00:24:25.726Z · score: 5 (3 votes) · LW · GW
So what if AI Debate survives this concern? That is, suppose we can reliably find a horizon-length for which running AI Debate is not existentially dangerous. One worry I've heard raised is that human judges will be unable to effectively judge arguments way above their level. My reaction is to this is that I don't know, but it's not an existential failure mode, so we could try it out and tinker with evaluation protocols until it works, or until we give up. If we can run AI Debate without incurring an existential risk, I don't see why it's important to resolve questions like this in advance.

There are two reasons to worry about this:

  • The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.
  • I think it is unlikely for a scheme like debate to be safe without being approximately competitive---the goal is to get honest answers which are competitive with a potential malicious agent, and then use those answers to ensure that malicious agent can't cause trouble and that the overall system can be stable to malicious perturbations. If your honest answers aren't competitive, then you can't do that and your situation isn't qualitatively different from a human trying to directly supervise a much smarter AI.

In practice I doubt the second consideration matters---if your AI could easily kill you in order to win a debate, probably someone else's AI has already killed you to take your money (and long before that your society totally fell apart). That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs.

Even if you were the only AI project on earth, I think competitiveness is the main thing responsible for internal regulation and stability. For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing). More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won't be at a competitive advantage.

Comment by paulfchristiano on Karma fluctuations? · 2020-07-02T16:03:28.423Z · score: 4 (2 votes) · LW · GW
(including and indeed especially content that I mostly agree with)

In retrospect this was too self-flattering. Plenty of the stuff I don't want to see expresses ideas that I agree with, but the majority expresses ideas I disagree with.

Comment by paulfchristiano on Second Wave Covid Deaths? · 2020-07-02T01:53:57.161Z · score: 27 (8 votes) · LW · GW

(Disclaimer: I don't know what I'm talking about, pointers to real literature would be more useful than this, every sentence deserves to be aggressively hedged/caveated, etc.)

Increasing test capacity: I've seen some people suggest that the second wave is just an artifact of increased testing in these states. If that were the case, then there would be no rise in covid cases to be explained. But then I would expect the fraction of tests that returned positive to be decreasing, and we aren't seeing that. This one seems like wishful thinking to me.

I don't think the increase in testing capacity fully explains the "second wave," but I think it does totally change the quantitative picture.

Intuitively I expect that (rate of change in positive test %) is better than (rate of change in confirmed cases) as a way of approximating (rate of change in actual cases). It also doesn't seem great, especially over multiple weeks, but I'll use it here until someone convinces me this is dumb.

Johns Hopkins aggregates testing numbers here. Picking CA as a second-wave state, it hit its minimum positive test rate of .04 on May 24. That rate rose by 20% by June 21, to 0.048 (and has kept going up).

If there was a 7 day lag, we'd expect to see a 20% increase in deaths by from May 31 to June 28. Eyeballing the google deaths data things look basically flat. So I guess that means a drop of ~20% in fatality rate over that month.

Trying again, let's take Georgia. Minimum of .058 on June 10, up 50% to .091 by June 21. Google seems to have deaths roughly constant or maybe decreasing from June 17 to June 28, which is a ballpark ~30% drop in fatality rate to offset the ~50% increase in infections.

One problem with these numbers is that I think the test numbers are for day the test occurred, but the death numbers are for the day they are reported. Would probably be better to use numbers for the day the death actually occurred, though I think that probably requires going at least a few days further back in time (which is going to make it harder to interpret cases like Georgia that hit the minimum only 3 weeks ago).

Delayed initial testing: When things were first taking off in first wave states, our testing capacity was way behind where it needed to be. Perhaps this heavily suppressed the initial "confirmed" numbers for the first wave, and so we should expect to see second wave deaths rise in the next few weeks?

It seems like the average time lag between showing symptoms and dying from COVID is something like 18 days (here, data from China but if anything I expect longer lags here). So if we were testing people earlier it seems like we could easily have more like a 2 week lag than a 1 week lag. That could mostly explain Georgia and California.

Overall I can't really tell what's going on, my sense is that your story in the post is basically right (and demographic changes sound likely) but that the mystery to be explained is *much* less than a 5x change in fatality rate. I feel like the constant death rate in the face of exploding cases is suspicious but best guess is that it's a coincidence, death rates will end up rising and IFR will end up modestly lower than the initial wave.

I would love to see a version of the analysis in the OP controlling for big increases in testing, and getting a more careful handle on lags between testing and death. Hopefully someone has already done that and it's just a matter of someone here finding the cite.

Comment by paulfchristiano on Relating HCH and Logical Induction · 2020-06-17T02:02:06.797Z · score: 2 (1 votes) · LW · GW
Think of it this way. We want to use a BRO to answer questions. We know it's very powerful, but at first, we don't have a clue as to how to answer questions with it. So we implement a Bayesian mixture-of-experts, which we call the "market". Each "trader" is a question-answering strategy: a way to use the BRO to answer questions. We give each possible strategy for using the BRO some weight. However, our "market" is itself a BRO computation. So, each trader has access to the market itself (in addition to many other computations which the BRO can access for them).

But a BRO only has oracle access to machines using smaller BROs, right? So a trader can't access the market?

(I don't think very much directly about the tree-size-limited version of HCH, normally I think of bounded versions like HCH(P) = "Humans consulting P's predictions of HCH(P)".)

Comment by paulfchristiano on Relating HCH and Logical Induction · 2020-06-17T01:59:13.079Z · score: 4 (2 votes) · LW · GW

There are two salient ways to get better predictions: deliberation and trial+error. HCH is about deliberation, and logical inductors are about trial and error. The benefit of trial and error is that it works eventually. The cost is that it doesn't optimize what you want (unless what you want is the logical induction criterion) and that it will generally get taken over by consequentialists who can exercise malicious influence a constant number of times before the asymptotics assert themselves. The benefit of deliberation is that its preferences are potentially specified indirectly by the original deliberator (rather than externally by the criterion for trial and error) and that if the original deliberator is strong enough they may suppress internal selection pressures, while the cost is that who knows if it works.

Comment by paulfchristiano on Karma fluctuations? · 2020-06-11T00:47:47.181Z · score: 15 (8 votes) · LW · GW
Is downvoting really used here for posts that are not spam or trolling?

Yes.

But I guess I’m surprised if people actually behave that way?

What makes this surprising?

some posts are controversial enough to receive active downvotes vs passive ignoring.

The point is to downvote content that you want to see less of, not content that you disagree with. If by "controversial" you mean "that some people don't want to see it," then I can't speak for others but I can say that personally the whole internet is full of content that I don't want to see (including and indeed especially content that I mostly agree with).

Comment by paulfchristiano on Inaccessible information · 2020-06-06T16:19:38.106Z · score: 2 (1 votes) · LW · GW

I agree that if you had a handle on accessing average optimal value then you'd be making headway.

I don't think it covers everything, since e.g. safety / integrity of deliberation / etc. are also important, and because instrumental values aren't quite clean enough (e.g. even if AI safety was super easy these agents would only work on the version that was useful for optimizing values from the mixture used).

But my bigger Q is how to make headway on accessing average optimal value, and whether we're able to make the problem easier by focusing on average optimal value.

Comment by paulfchristiano on Reply to Paul Christiano on Inaccessible Information · 2020-06-05T18:22:05.960Z · score: 34 (13 votes) · LW · GW

I thought this was a great summary, thanks!

Yes it’s true that much of MIRI’s research is about finding a solution to the design problem for intelligent systems that does not rest on a blind search for policies that satisfy some evaluation procedure. But it seems strange to describe this approach as “hope you can find some other way to produce powerful AI”, as though we know of no other approach to engineering sophisticated systems other than search.

I agree that the success of design in other domains is a great sign and reason for hope. But for now such approaches are being badly outperformed by search (in AI).

Maybe it's unfair to say "find some other way to produce powerful AI" because we already know the way: just design it yourself. But I think "design" is basically just another word for "find some way to do it," and we don't yet have any history of competitive designs to imitate or extrapolate from.

Personally, the main reason I'm optimistic about design in the future is that the designers may themselves be AI systems. That may help close the current gap between design and search, since both could then benefit from large amounts of computing power. (And it's plausible that we are currently bottlenecked on a meta-design problem of figuring out how to build automated designers.) That said, it's completely unclear whether that will actually beat search.

I consider my job as preparing for the worst w.r.t. search, since that currently seems like a better place to invest resources (and I think it's reasonably likely that dangerous search will be involved even if our AI ecosystem mostly revolves around design). I do think that I'd fall back to pushing on design if this ended up looking hopeless enough. If that happens, I'm hoping that by that time we'll have some much harder evidence that search is a lost cause, so that we can get other people to also jump ship from search to design.

Comment by paulfchristiano on Inaccessible information · 2020-06-03T23:36:34.060Z · score: 8 (4 votes) · LW · GW
To help check my understanding, your previously described proposal to access this "inaccessible" information involves building corrigible AI via iterated amplification, then using that AI to capture "flexible influence over the future", right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?
(I'll try to guess.) Is it that corrigibility is about short-term preferences-on-reflection and short-term preferences-on-reflection may themselves be inaccessible information?

I think that's right. The difficulty is that short-term preferences-on-reflection depend on "how good is this situation actually?" and that judgment is inaccessible.

This post doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall. This post is part of the effort to pin down the hard cases for iterated amplification, which I suspect will also be hard cases for other alignment strategies (for the kinds of reasons discussed in this post).

This seems similar to what I wrote in an earlier thread: "What if the user fails to realize that a certain kind of resource is valuable?

Yeah, I think that's similar. I'm including this as part of the alignment problem---if unaligned AIs realize that a certain kind of resource is valuable but aligned AIs don't realize that, or can't integrate it with knowledge about what the users want (well enough to do strategy stealing) then we've failed to build competitive aligned AI.

(By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)"

Yes.

At the time I thought you proposed to solve this problem by using the user's "preferences-on-reflection", which presumably would correctly value all resources/costs. So again is it just that "preferences-on-reflection" may itself be inaccessible?

Yes.

Besides the above, can you give some more examples of (what you think may be) "inaccessible knowledge that is never produced by amplification"?

If we are using iterated amplification to try to train a system that answers the question "What action will put me in the best position to flourish over the long term?" then in some sense the only inaccessible information that matters is "To what extent will this action put me in a good position to flourish?" That information is potentially inaccessible because it depends on the kind of inaccessible information described in this post---what technologies are valuable? what's the political situation? am I being manipulated? is my physical environment being manipulated?---and so forth. That information in turn is potentially inaccessible because it may depend on internal features of models that are only validated by trial and error, for which we can't elicit the correct answer either by directly checking it nor by transfer from other accessible features of the model.

(I might be misunderstanding your question.)

(I guess an overall feedback is that in most of the post you discuss inaccessible information without talking about amplification, and then quickly talk about amplification in the last section, but it's not easy to see how the two ideas relate without more explanations and examples.)

By default I don't expect to give enough explanations or examples :) My next step in this direction will be thinking through possible approaches for eliciting inaccessible information, which I may write about but which I don't expect to be significantly more useful than this. I'm not that motivated to invest a ton of time in writing about these issues clearly because I think it's fairly likely that my understanding will change substantially with more thinking, and I think this isn't a natural kind of "checkpoint" to try to explain clearly. Like most posts on my blog, you should probably regard this primarily as a record of Paul's thinking. (Though it would be great if it could be useful as explanation as a side effect, and I'm willing to put in a some time to try to make it useful as explanation, just not the amount of time that I expect would be required.)

(My next steps on exposition will be trying to better explain more fundamental aspects of my view.)

Comment by paulfchristiano on Inaccessible information · 2020-06-03T23:25:23.242Z · score: 6 (3 votes) · LW · GW

I don't mean to say that "What's the weight of Neptune?" is accessible if a model transfers to saying "The weight of Neptune is 100kg." I mean that "What's the weight of Neptune?" is accessible if a model transfers to correctly reporting the weight of Neptune (or rather if it transfers in such a way that its answers give real evidence about the weight of Neptune, or rather that the evidence is accessible in that case, or... you can see why it's hard to be formal).

If we wanted to be more formal but less correct, we could talk about accessibility of functions from possible worlds. Then a function f* is accessible when you can check a claimed value f* (using oracles for other accessible functions), or if you can find some encoding R of functions and some value r* such that the simplest function mapping R(f) -> f(real world) for all accessible functions also maps r* -> f*(real world).

Comment by paulfchristiano on Solar system colonisation might not be driven by economics · 2020-04-24T20:03:27.770Z · score: 5 (3 votes) · LW · GW

I think we have about 10 more doublings of energy consumption before we're using most incident solar energy. We're currently doubling energy use every few decades, so that could sustain a few centuries of growth at the current rate. (Like many folks on LW, I expect growth to accelerate enough that we start running up against those limits within this century though.)

Comment by paulfchristiano on Solar system colonisation might not be driven by economics · 2020-04-23T03:27:04.851Z · score: 4 (3 votes) · LW · GW

What timescale are you talking about? I guess it's asking "are we going to colonize the solar system before growing a lot here on earth?" I agree that seems pretty unlikely though I'm not sure this is the best argument.

My default expectation would be that humans would be motivated to move to space when we've expanded enough that doing things on earth is getting expensive---we are running out of space, sunlight, material, or whatever else. You don't have to extrapolate growth that far before you start having quite severe crunches, so if growth continues (even at the current rate) then it won't be that long before we are colonizing the solar system.

(Even if people did expand into space before we needed the resources, it wouldn't matter much since they'd be easily overtaken by later colonists.)

Comment by paulfchristiano on Seemingly Popular Covid-19 Model is Obvious Nonsense · 2020-04-13T01:37:49.922Z · score: 26 (8 votes) · LW · GW

From a quick skim of the paper it looks like they effectively assume that implementing any 3 of those social distancing measures at the same time that Wuhan implemented their lockdown would lead to the same number of total deaths (with some adjustments).

This is less aggressive than assuming no new deaths after lockdown, but does seem quite optimistic given that the lockdown in Wuhan seems (much) more severe than school closures + travel restrictions + non-essential business closures. And this part of the model seems to be assumed rather than fit to data.

Comment by paulfchristiano on Three Kinds of Competitiveness · 2020-04-01T01:19:07.370Z · score: 5 (3 votes) · LW · GW

I think our current best implementation of IDA would neither be competitive nor scalably aligned :)

Comment by paulfchristiano on Three Kinds of Competitiveness · 2020-03-31T16:49:23.928Z · score: 4 (2 votes) · LW · GW

In most cases you can continuously trade off performance and cost; for that reason I usually think of them as a single metric of "competitive with X% overhead." I agree there are cases where they come apart, but I think there are pretty few examples. (Even for nuclear weapons you could ask "how much more expensive is it to run a similarly-destructive bombing campaign with conventional explosives.")

I think this works best if you consider a sequence of increments each worth +10%, rather than say accumulating 70 of those increments, because "spend 1000x more" is normally not available and so we don't have a useful handle on what a technology looks like when scaled up 1000x (and that scaleup would usually involve a bunch of changes that are hard to anticipate).

That is, if we have a sequence of technologies A0, A1, A2, ..., AN, each of which is 10% cheaper than the one before, then we may say that AN is better than A0 by N 10% steps (rather than trying to directly evaluate how many orders of magnitude you'd have to spend on A0 to compete with AN, because the process "spend a thousand times more on A0 in a not-stupid way" is actually kind of hard to imagine).

Comment by paulfchristiano on Three Kinds of Competitiveness · 2020-03-31T16:43:32.877Z · score: 5 (3 votes) · LW · GW

IDA is really aiming to be cost-competitive and performance-competitive, say to within overhead of 10%. That may or may not be possible, but it's the goal.

If the compute required to build and run your reward function is small relative to the compute required to train your model, then it seems like overhead is small. If you can do semi-supervised RL and only require a reward function evaluation on a minority of trajectories (e.g. because most of the work is learning about how to manipulate the environment), then you can be OK as long as the cost of running the reward function isn't too much higher.

Whether that's possible is a big open question. Whether it's date competitive depends on how fast you figure out how to do it.

Comment by paulfchristiano on What are the most plausible "AI Safety warning shot" scenarios? · 2020-03-27T16:00:02.205Z · score: 20 (12 votes) · LW · GW

I think "makes 50% of currently-skeptical people change their minds" is a high bar for a warning shot. On that definition e.g. COVID-19 will probably not be a warning shot for existential risk from pandemics. I do think it is plausible that AI warning shots won't be much better than pandemic warning shots. (On your definition it seems likely that there won't ever again be a warning shot for any existential risk.)

For a more normal bar, I expect plenty of AI systems to fail at large scales in ways that seem like "malice," and then to cover up the fact that they've failed. AI employees will embezzle funds, AI assistants will threaten and manipulate their users, AI soldiers will desert. Events like this will make it clear to most people that there is a serious problem, which plenty of people will be working on in order to make AI useful. The base rate will remain low but there will be periodic high-profile blow-ups.

I don't expect this kind of total unity of AI motivations you are imagining, where all of them want to take over the world (so that the only case where you see something frightening is a failed bid to take over the world). That seems pretty unlikely to me, though it's conceivable (maybe 10-20%?) and may be an important risk scenario. I think it's much more likely that we stamp out all of the other failures gradually, and are left with only the patient+treacherous failures, and in that case whether it's a warning shot or not depends entirely on how much people are willing to generalize.

I do think the situation in the AI community will be radically different after observing these kinds of warning shots, even if we don't observe an AI literally taking over a country.

There is a very narrow range of AI capability between "too stupid to do significant damage of the sort that would scare people" and "too smart to fail at takeover if it tried."

Why do you think this is true? Do you think it's true of humans? I think it's plausible if you require "take over a country" but not if you require e.g. "kill plenty of people" or "scare people who hear about it a lot."

(This is all focused on intent alignment warning shots. I expect there will also be other scary consequences of AI that get people's attention, but the argument in your post seemed to be just about intent alignment failures.)

Comment by paulfchristiano on March Coronavirus Open Thread · 2020-03-12T02:36:09.213Z · score: 22 (11 votes) · LW · GW

Disclaimer: I don't know if this is right, I'm reasoning entirely from first principles.

If there is dispersion in R0, then there would likely be some places where the virus survives even if you take draconian measures. If you later relax those draconian measures, it will begin spreading in the larger population again at the same rate as before.

In particular, if the number of cases is currently decreasing overall most places, then soon most of the cases will be in regions or communities where containment was less successful and so the number of cases will stop decreasing.

If it's infeasible to literally stamp it out everywhere (which I've heard), then you basically want to either delay long enough to have a vaccine or have people get sick at the largest rate that the health care system can handle.

Comment by paulfchristiano on Writeup: Progress on AI Safety via Debate · 2020-02-20T02:37:43.449Z · score: 6 (3 votes) · LW · GW

The intuitive idea is to share activations as well as weights, i.e. to have two heads (or more realistically one head consulted twice) on top of the same model. There is a fair amount of uncertainty about this kind of "detail" but I think for now it's smaller than the fundamental uncertainty about whether anything in this vague direction will work.

Comment by paulfchristiano on On the falsifiability of hypercomputation, part 2: finite input streams · 2020-02-17T20:27:51.761Z · score: 6 (3 votes) · LW · GW

It's an interesting coincidence that arbitration is the strongest thing we can falsify, and also apparently the strongest thing that can consistently apply to itself (if we allow probabilistic arbitration). Maybe not a coincidence?

Comment by paulfchristiano on On the falsifiability of hypercomputation, part 2: finite input streams · 2020-02-17T20:27:35.185Z · score: 8 (4 votes) · LW · GW

It's not obvious to me that "consistent with PA" is the right standard for falsification though. It seems like simplicity considerations might lead you to adopt a stronger theory, and that this might allow for some weaker probabilistic version of falsification for things beyond arbitration. After all, how did we get induction anyway?

(Do we need induction, or could we think of falsification as being relative to some weaker theory?)

(Maybe this is just advocating for epistemic norms other than falsification though. It seems like the above move would be analogous to saying: the hypothesis that X is a halting oracle is really simple and explains the data, so we'll go with it even though it's not falsifiable.)