Review of "Learning Normativity: A Research Agenda" 2021-06-06T13:33:28.371Z
Review of "Fun with +12 OOMs of Compute" 2021-03-28T14:55:36.984Z
A Critique of Non-Obstruction 2021-02-03T08:45:42.228Z
Optimal play in human-judged Debate usually won't answer your question 2021-01-27T07:34:18.958Z
Literature Review on Goal-Directedness 2021-01-18T11:15:36.710Z


Comment by Joe_Collman on Teaching ML to answer questions honestly instead of predicting human answers · 2021-06-01T22:03:21.113Z · LW · GW

Ok, that all makes sense, thanks.

...and is-correct basically just tests whether they are equal.

So here "equal" would presumably be "essentially equal in the judgement of complex process", rather than verbatim equality of labels (the latter seems silly to me; if it's not silly I must be missing something).

Comment by Joe_Collman on Teaching ML to answer questions honestly instead of predicting human answers · 2021-05-28T21:40:07.919Z · LW · GW

Very interesting, thanks.

Just to check I'm understanding correctly, in step 2, do you imagine the complex labelling process deferring to the simple process iff the simple process is correct (according to the complex process)? Assuming that we require precise agreement, something of that kind seems necessary to me.

I.e. the labelling process would be doing something like this:

# Return a pair of (simple, complex) labels for a given input

simple_label = GenerateSimpleLabel(input)

if is_correct(simple_label, input):
    return simple_label, simple_label
    return simple_label, GenerateComplexLabel(input)

Does that make sense?

A couple of typos: 
"...we are only worried if the model [understands? knows?] the dynamics..."
"...don’t collect training data in situations without [where?] strong adversaries are trying..."

Comment by Joe_Collman on AMA: Paul Christiano, alignment researcher · 2021-05-03T05:14:43.741Z · LW · GW

Thanks, that's very helpful. It still feels to me like there's a significant issue here, but I need to think more. At present I'm too confused to get much beyond handwaving.

 A few immediate thoughts (mainly for clarification; not sure anything here merits response):

  • I had been thinking too much lately of [isolated human] rather than [human process].
  • I agree the issue I want to point to isn't precisely OOD generalisation. Rather it's that the training data won't be representative of the thing you'd like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I'm worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias.
  • It does seem hard to ensure you don't end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner's resource levels or motives.
  • The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML].
    • If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe.
  • W.r.t. H's making value calls, my worry isn't that they're asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least).

I'm going to try writing up the core of my worry in more precise terms.
It's still very possible that any non-trivial substance evaporates under closer scrutiny.

Comment by Joe_Collman on AMA: Paul Christiano, alignment researcher · 2021-04-29T04:45:58.598Z · LW · GW

I'd be interested in your thoughts on human motivation in HCH and amplification schemes.
Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?

Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn't relate to the question/task.
[I don't think you've addressed this at all recently - I've only come across specifying enlightened judgement precisely]

I'd appreciate if you could say if/where you disagree with the following kind of argument.
I'd like to know what I'm missing:

Motivation seems like an eventual issue for imitative amplification. Even for an H who always attempted to give good direct answers to questions in training, the best models at predicting H's output would account for differing levels of enthusiasm, focus, effort, frustration... based in part on H's attitude to the question and the opportunity cost in answering it directly.

The 'correct' (w.r.t. alignment preservation) generalisation must presumably be in all circumstances to give the output that H would give. In scenarios where H wouldn't directly answer the question (e.g. because H believed the value of answering the question were trivial relative to opportunity cost), this might include deception, power-seeking etc. Usually I'd expect high value true-and-useful information unrelated to the question; deception-for-our-own-good just can't be ruled out.

If a system doesn't always adapt to give the output H would, on what basis do we trust it to adapt in ways we would endorse? It's unclear to me how we avoid throwing the baby out with the bathwater here.

Or would you expect to find Hs for whom such scenarios wouldn't occur? This seems unlikely to me: opportunity cost would scale with capability, and I'd predict every H would have their price (generally I'm more confident of this for precisely the kinds of H I'd want amplified: rational, altruistic...).

If we can't find such Hs, doesn't this at least present a problem for detecting training issues?: if HCH may avoid direct answers or deceive you (for worthy-according-to-H reasons), then an IDA of that H eventually would too. At that point you'd need to distinguish [benign non-question-related information] and [benevolent deception] from [malign obfuscation/deception], which seems hard (though perhaps no harder than achieving existing oversight desiderata???).

Even assuming that succeeds, you wouldn't end up with a general-purpose question-answerer or task-solver: you'd get an agent that does whatever an amplified [model predicting H-diligently-answering-training-questions] thinks is best. This doesn't seem competitive across enough contexts.

...but hopefully I'm missing something.

Comment by Joe_Collman on Auctioning Off the Top Slot in Your Reading List · 2021-04-15T00:26:41.511Z · LW · GW

That's a good point, though I do still think you need the right motivation. Where you're convinced you're right, it's very easy to skim past passages that are 'obviously' incorrect, and fail to question assumptions.
(More generally, I do wonder what's a good heuristic for this - clearly it's not practical to constantly go back to first principles on everything; I'm not sure how to distinguish [this person is applying a poor heuristic] from [this person is applying a good heuristic to very different initial beliefs])

Perhaps the best would be a combination: a conversation which hopefully leaves you with the thought that you might be wrong, followed by the book to allow you to go into things on your own time without so much worry over losing face or winning.

Another point on the cause-for-optimism side is that being earnestly interested in knowing the truth is a big first step, and I think that description fits everyone mentioned so far.

Comment by Joe_Collman on Auctioning Off the Top Slot in Your Reading List · 2021-04-14T22:11:26.851Z · LW · GW

I'd guess that reciprocal exchanges might work better for friends:
I'll read any m books you pick, so long as you read the n books I pick.

Less likely to get financial ick-factor, and it's always possible that you'll gain from reading the books they recommend.

Perhaps this could scale to public intellectuals where there's either a feeling of trust or some verification mechanism (e.g. if the intellectual wants more people to read [some neglected X], and would willingly trade their time reading Y for a world where X were more widely appreciated).

Whether or not money is involved, I'm sceptical of the likely results for public intellectuals - or in general for people strongly attached to some viewpoint. The usual result seems to be a failure to engage with the relevant points. (perhaps not attacking head-on is the best approach: e.g. the asymmetrical weapons post might be a good place to start for Deutsch/Pinker)

Specifically, I'm thinking of David Deutsch speaking about AGI risk with Sam Harris: he just ends up telling a story where things go ok (or no worse than with humans), and the implicit argument is something like "I can imagine things going ok, and people have been incorrectly worried about things before, so this will probably be fine too". Certainly Sam's not the greatest technical advocate on the AGI risk side, but "I can imagine things going ok..." is a pretty general strategy.

The same goes for Steven Pinker, who spends nearly two hours with Stuart Russell on the FLI podcast, and seems to avoid actually thinking in favour of simply repeating the things he already believes. There's quite a bit of [I can imagine things going ok...], [People have been wrong about downsides in the past...], and [here's an argument against your trivial example], but no engagement with the more general points behind the trivial example.

Steven Pinker has more than enough intelligence to engage properly and re-think things, but he just pattern-matches any AI risk argument to [some scary argument that the future will be worse] and short-circuits to enlightenment-now cached thoughts. (to be fair to Steven, I imagine doing a book tour will tend to set related cached thoughts in stone, so this is a particularly hard case... but you'd hope someone who focuses on the way the brain works would realise this danger and adjust)

When you're up against this kind of pattern-matching, I don't think even the ideal book is likely to do much good. If two hours with Stuart Russell doesn't work, it's hard to see what would.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-04-13T17:39:44.412Z · LW · GW

Unless I've confused myself badly (always possible!), I think either's fine here. The | version just takes out a factor that'll be common to all hypotheses: [p(e+) / p(e-)]. (since p(Tk & e+) ≡ p(Tk | e+) * p(e+))

Since we'll renormalise, common factors don't matter. Using the | version felt right to me at the time, but whatever allows clearer thinking is the way forward.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-04-12T19:59:25.802Z · LW · GW

Taking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don't think something like "It's directly clear that 9 OOM will almost certainly be enough, by a similar argument".

Certainly if they do conclude anything like that, then it's going to massively drop their odds on 9-12 too. However, I'd still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range.

On (1), I agree that the same goes for pretty-much any argument: that's why it's important. If you update without factoring in (some approximation of) your best judgement of the evidence's impact on all hypotheses, you're going to get the wrong answer. This will depend highly on your underlying model.

On the information content of the post, I'd say it's something like "12 OOMs is probably enough (without things needing to scale surprisingly well)". My credence for low OOM values is mostly based on worlds where things scale surprisingly well.

But this is a bit weird; my post didn't talk about the <7 range at all, so why would it disproportionately rule out stuff in that range?

I don't think this is weird. What matters isn't what the post talks about directly - it's the impact of the evidence provided on the various hypotheses. There's nothing inherently weird about evidence increasing our credence in [TAI by +10OOM] and leaving our credence in [TAI by +3OOM] almost unaltered (quite plausibly because it's not too relevant to the +3OOM case).

Compare the 1-2-3 coins example: learning y tells you nothing about the value of x. It's only ruling out any part of the 1 outcome in the sense that it maintains [x_heads & something independent is heads], and rules out [x_heads & something independent is tails]. It doesn't need to talk about x to do this.

You can do the same thing with the TAI first at k OOM case - call that Tk. Let's say that your post is our evidence e and that e+ stands for [e gives a compelling argument against T13+].
Updating on e+ you get the following for each k:
Initial hypotheses: [Tk & e+], [Tk & e-]
Final hypothesis: [Tk & e+]

So what ends up mattering is the ratio p[Tk | e+] : p[Tk | e-]
I'm claiming that this ratio is likely to vary with k.

Specifically, I'd expect T1 to be almost precisely independent of e+, while I'd expect T8 to be correlated. My reason on the T1 is that I think something radically unexpected would need to occur for T1 to hold, and your post just doesn't seem to give any evidence for/against that.
I expect most people would change their T8 credence on seeing the post and accepting its arguments (if they've not thought similar things before). The direction would depend on whether they thought the post's arguments could apply equally well to ~8 OOM as 12.

Note that I am assuming the argument ruling out 13+ OOM is as in the post (or similar).
If it could take any form, then it could be a more or less direct argument for T1.

Overall, I'd expect most people who agree with the post's argument to update along the following lines (but smoothly):
T0 to Ta: low increase in credence
Ta to Tb: higher increase in credence
Tb+: reduced credence

with something like (0 < a < 6) and (4 < b < 13).
I'm pretty sure a is going to be non-zero for many people.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-04-04T21:36:43.909Z · LW · GW

[[ETA, I'm not claiming the >12 OOM mass must all go somewhere other than the <4 OOM case: this was a hypothetical example for the sake of simplicity. I was saying that if I had such a model (with zwomples or the like), then a perfectly good update could leave me with the same posterior credence on <4 OOM.
In fact my credence on <4 OOM was increased, but only very slightly]]

First I should clarify that the only point I'm really confident on here is the "In general, you can't just throw out the >12 OOM and re-normalise, without further assumptions" argument. 

I'm making a weak claim: we're not in a position of complete ignorance w.r.t. the new evidence's impact on alternate hypotheses.

My confidence in any specific approach is much weaker: I know little relevant data.

That said, I think the main adjustment I'd make to your description is to add the possibility for sublinear scaling of compute requirements with current techniques. E.g. if beyond some threshold meta-learning efficiency benefits are linear in compute, and non-meta-learned capabilities would otherwise scale linearly, then capabilities could scale with the square root of compute (feel free to replace with a less silly example of your own).

This doesn't require "We'll soon get more ideas" - just a version of "current methods scale" with unlucky (from the safety perspective) synergies.

So while the "current methods scale" hypothesis isn't confined to 7-12 OOMs, the distribution does depend on how things scale: a higher proportion of the 1-6 region is composed of "current methods scale (very) sublinearly".

My p(>12 OOM | sublinear scaling) was already low, so my p(1-6 OOM | sublinear scaling) doesn't get much of a post-update boost (not much mass to re-assign).
My p(>12 OOM | (super)linear scaling) was higher, but my p(1-6 OOM | (super)linear scaling) was low, so there's not too much of a boost there either (small proportion of mass assigned).

I do think it makes sense to end up with a post-update credence that's somewhat higher than before for the 1-6 range - just not proportionately higher. I'm confident the right answer for the lower range lies somewhere between [just renormalise] and [don't adjust at all], but I'm not at all sure where.

Perhaps there really is a strong argument that the post-update picture should look almost exactly like immediate renormalisation. My main point is that this does require an argument: I don't think its a situation where we can claim complete ignorance over impact to other hypotheses (and so renormalise by default), and I don't think there's a good positive argument for [all hypotheses will be impacted evenly].

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-03-31T19:57:30.019Z · LW · GW

Yes, we're always renormalising at the end - it amounts to saying "...and the new evidence will impact all remaining hypotheses evenly". That's fine once it's true.

I think perhaps I wasn't clear with what I mean by saying "This doesn't say anything...".
I meant that it may say nothing in absolute terms - i.e. that I may put the same probability of [TAI at 4 OOM] after seeing the evidence as before.

This means that it does say something relative to other not-ruled-out hypotheses: if I'm saying the new evidence rules out >12 OOM, and I'm also saying that this evidence should leave p([TAI at 4 OOM]) fixed, I'm implicitly claiming that the >12 OOM mass must all go somewhere other than the 4 OOM case.

Again, this can be thought of as my claiming e.g.:
[TAI at 4 OOM] will happen if and only if zwomples work
There's a 20% chance zwomples work
The new 12 OOM evidence says nothing at all about zwomples

In terms of what I actually think, my sense is that the 12 OOM arguments are most significant where [there are no high-impact synergistic/amplifying/combinatorial effects I haven't thought of].
My credence for [TAI at < 4 OOM] is largely based on such effects. Perhaps it's 80% based on some such effect having transformative impact, and 20% on we-just-do-straightforward-stuff. [Caveat: this is all just ottomh; I have NOT thought for long about this, nor looked at much evidence; I think my reasoning is sound, but specific numbers may be way off]

Since the 12 OOM arguments are of the form we-just-do-straightforward-stuff, they cause me to update the 20% component, not the 80%. So the bulk of any mass transferred from >12 OOM, goes to cases where p([we-just-did-straightforward-stuff and no strange high-impact synergies occurred]|[TAI first occurred at this level]) is high.

Comment by Joe_Collman on Conspicuous saving · 2021-03-31T14:43:34.081Z · LW · GW

It's not entirely clear to me either.
Here are a few quick related thoughts:

  1. We shouldn't assume it's clear that higher-long-term QoL is the primary motivator for most people who do save. For most of them, it's something their friends, family, co-workers... think is a good idea.
  2. Evolutionary fitness doesn't care (directly) about QoL.
  3. There may be unhelpful game theory at work. If in some groups where people tend to spend X, there's quite a bit to gain in spending [X + 1], and a significant loss in spending [X - 1], you'd expect group spending to increase.
  4. Even if we're talking about [what's effective] rather than [our evolutionary programming], we're still navigating other people's evolutionary programming. Being slightly above/below average in spending may send a disproportionate signal.
  5. The value of a faked signal is higher for people who don't have other channels to signal something similar.
  6. Other groups likely are sending similar signals in other ways. E.g. consider intellectuals sitting around having lengthy philosophical discussions that don't lead to action. They're often wasting time, simultaneously showing off skills that they could be using more productively, but aren't. (this is also a problem where it's a genuine waste - my point is only that very few people avoid doing this in some form)

Of course none of this makes it any less of a problem (to the extent it's bringing down collective QoL) - but possibly a difficult problem that we'd expect to exist.

Solutions-wise, my main thought is that you'd want to find a way to channel signalling-waste efficiently into public goods - so that personal 'waste' becomes a collective advantage (hopefully).

It is also worth noting that not all 'wasteful' spending is bad for society. E.g. consider early adopters of new and expensive technology: without people willing to 'waste' money on the Tesla Roadster, getting electric cars off the ground may have been a much harder problem.

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-03-30T13:09:40.509Z · LW · GW

We do gain evidence on at least some alternatives, but not on all the factors which determine the alternatives. If we know something about those factors, we can't usually just renormalise. That's a good default, but it amounts to an assumption of ignorance.

Here's a simple example:
We play a 'game' where you observe the outcome of two fair coin tosses x and y.
You score:
1 if x is heads
2 if x is tails and y is heads
3 if x is tails and y is tails

So your score predictions start out at:
1 : 50%
2 : 25%
3 : 25%

We look at y and see that it's heads. This rules out 3.
Renormalising would get us:
1 : 66.7%
2 : 33.3%
3: 0%

This is clearly silly, since we ought to end up at 50:50 - i.e. all the mass from 3 should go to 2. This happens because the evidence that falsified 3 points was insignificant to the question "did you score 1 point?".
On the other hand, if we knew nothing about the existence of x or y, and only knew that we were starting from (1: 50%, 2: 25%, 3: 25%), and that 3 had been ruled out, it'd make sense to re-normalise.

In the TAI case, we haven't only learned that 12 OOM is probably enough (if we agree on that). Rather we've seen specific evidence that leads us to think 12 OOM is probably enough. The specifics of that evidence can lead us to think things like "This doesn't say anything about TAI at +4 OOM, since my prediction for +4 is based on orthogonal variables", or perhaps "This makes me near-certain that TAI will happen by +10 OOM, since the +12 OOM argument didn't require more than that".

Comment by Joe_Collman on Review of "Fun with +12 OOMs of Compute" · 2021-03-29T18:26:56.634Z · LW · GW

If you have a bunch of hypotheses (e.g. "It'll take 1 more OOM," "It'll take 2 more OOMs," etc.) and you learn that some of them are false or unlikely (only 10% chance of it taking more than 12" then you should redistribute the mass over all your remaining hypotheses, preserving their relative strengths.

This depends on the mechanism by which you assigned the mass initially - in particular, whether it's absolute or relative. If you start out with specific absolute probability estimates as the strongest evidence for some hypotheses, then you can't just renormalise when you falsify others.

E.g. consider we start out with these beliefs:
If [approach X] is viable, TAI will take at most 5 OOM; 20% chance [approach X] is viable.
If [approach X] isn't viable, 0.1% chance TAI will take at most 5 OOM.
30% chance TAI will take at least 13 OOM.

We now get this new information:
There's a 95% chance [approach Y] is viable; if [approach Y] is viable TAI will take at most 12 OOM.

We now need to reassign most of the 30% mass we have on >13 OOM, but we can't simply renormalise: we haven't (necessarily) gained any information on the viability of [approach X].
Our post-update [TAI <= 5OOM] credence should remain almost exactly 20%. Increasing it to ~26% would not make any sense.

For AI timelines, we may well have some concrete, inside-view reasons to put absolute probabilities on contributing factors to short timelines (even without new breakthroughs we may put absolute numbers on statements of the form "[this kind of thing] scales/generalises"). These probabilities shouldn't necessarily be increased when we learn something giving evidence about other scenarios. (the probability of a short timeline should change, but in general not proportionately)

Perhaps if you're getting most of your initial distribution from a more outside-view perspective, then you're right.

Comment by Joe_Collman on Conspicuous saving · 2021-03-21T22:17:05.519Z · LW · GW

Oh I'm not claiming that non-wasted wealth signalling is useless. I'm saying that frivolous spending and saving send very different signals - and that saving doesn't send the kind of signal tEitB focuses on.

Whether a public saving-signal would actually help is an empirical question. My guess is that it wouldn't help in most contexts where unwise spending is currently the norm, since I'd expect it to signal lack of ability/confidence. Of course I may be wrong.

When considering status, I think wealth is largely valued as an indirect signal of ability (in a broad sense). E.g. compare getting $100m by founding a business vs winning a lottery. The lottery winner gets nothing like the status bump that the business founder gets.
This is another reason I think spending sends a stronger overall signal in many contexts: it (usually) says both [I had the ability to get this wealth] and [I have enough ability that I expect not to need it].

Comment by Joe_Collman on Conspicuous saving · 2021-03-21T14:40:28.967Z · LW · GW

This is interesting, but I think largely misses the point that elephant-in-the-brain-style signalling is often about sending the signal "I can afford to waste resources, because I've got great confidence in my ability to do well in the future without them".

Saving just doesn't achieve this - it achieves the opposite:
"Look at all my savings; I can't afford to waste any resources, since I have little confidence in my ability to do well in the future without them".

It makes evolutionary sense to signal ability rather than resources, since resources can't be passed on (until very recently, at least), and won't necessarily translate to all situations. By wasting resources, you're signalling your confidence you'll do well whatever the future throws at you.

If you want a signalling approach that improves the world, I think it has to be conspicuous donation, not conspicuous saving.

Comment by Joe_Collman on Voting-like mechanisms which address size of preferences? · 2021-03-19T01:09:19.858Z · LW · GW

Very interesting - I'll give some thought to answers, but for now a quick cached-thought comment:

A proposed solution: bills cannot be contradicted by bills which pass with less votes.

I don't think this is practical as a full solution to this problem, since a bill doesn't need to explicitly contradict a previous bill in order to make the previous one irrelevant.

You've made foobles legal? We'll require fooble licenses costing two years' training and a million dollars.
You've banned smarbling? We'll switch all resources from anti-smarbling enforcement to crack down on unlicensed foobles.

Of course you could craft the fooble/smarbling laws to avoid these pitfalls, but there's more than one way to smarble a fooble.

Comment by Joe_Collman on Strong Evidence is Common · 2021-03-17T23:09:26.650Z · LW · GW

Sure, but what I mean is that this is hard to do for hypothesis-location, since post-update you still have the hypothesis-locating information, and there's some chance that your "explaining away" was itself incorrect (or your memory is bad, you have bugs in your code...).

For an extreme case, take Donald's example, where the initial prior would be 8,000,000 bits against.
Locating the hypothesis there gives you ~8,000,000 bits of evidence. The amount you get in an "explaining away" process is bounded by your confidence in the new evidence. How sure are you that you correctly observed and interpreted the "explaining away" evidence? Maybe you're 20 bits sure; perhaps 40 bits sure. You're not 8,000,000 bits sure.

Then let's say you've updated down quite a few times, but not yet close to the initial prior value. For the next update, how sure are you that the stored value that you'll be using as your new prior is correct? If you're human, perhaps you misremembered; if a computer system, perhaps there's a bug...
Below a certain point, the new probability you arrive at will be dominated by contributions from weird bugs, misrememberings etc.
This remains true until/unless you lose the information describing the hypothesis itself.

I'm not clear how much this is a practical problem - I agree you can update the odds of a hypothesis down to no-longer-worthy-of-consideration. In general, I don't think you can get back to the original prior without making invalid assumptions (e.g. zero probability of a bug/hack/hoax...), or losing the information that picks out the hypothesis.

Comment by Joe_Collman on Strong Evidence is Common · 2021-03-16T22:30:15.026Z · LW · GW

It's worth noting that most of the strong evidence here is in locating the hypothesis.
That doesn't apply to the juggling example - but that's not so much evidence. "I can juggle" might take you from 1:100 to 10:1. Still quite a bit, but 10 bits isn't 24.

I think this relates to Donald's point on the asymmetry between getting from exponentially small to likely (commonplace) vs getting from likely to exponentially sure (rare). Locating a hypothesis can get you the first, but not the second.

It's even hard to get back to exponentially small chance of x once it seems plausible (this amounts to becoming exponentially sure of ¬x). E.g., if I say "My name is Mortimer Q. Snodgrass... Only kidding, it's actually Joe Collman", what are the odds that my name is Mortimer Q. Snodgrass? 1% perhaps, but it's nowhere near as low as the initial prior.
The only simple way to get all the way back is to lose/throw-away the hypothesis-locating information - which you can't do via a Bayesian update. I think that's what makes privileging the hypothesis such a costly error: in general you can't cleanly update your way back (if your evidence, memory and computation were perfectly reliable, you could - but they're not). The way to get back is to find the original error and throw it out.

How difficult is it to get into the top 1% of traders? To be 50% sure you're in the top 1%, you only need 200:1 evidence. This seemingly large odds ratio might be easy to get.

I don't think your examples say much about this. They're all of the form [trusted-in-context source] communicates [unlikely result]. They don't seem to show a reason to expect strong evidence may be easy to get when this pattern doesn't hold. (I suppose they do say that you should check for the pattern - and probably it is useful to occasionally be told "There may be low-hanging fruit. Look for it!")

Comment by Joe_Collman on Anna and Oliver discuss Children and X-Risk · 2021-03-15T21:47:01.849Z · LW · GW

I hope you find the time. I hadn't realised this was happening and would be interested in any thoughts and ideas. It's an issue that's high impact, broadly relevant, hard to get good data and easy to reason poorly about - so I'm glad to see some thoughtful discussion.

Comment by Joe_Collman on AstraZeneca COVID Vaccine and blood clots · 2021-03-15T15:19:03.631Z · LW · GW

Is there a source that shows there's even a correlation? Please link one if there is - perhaps I missed it. The reports I've seen don't suggest any - e.g. bmj report.

From what (little) I've seen, this seems to be evidence for the hypothesis "Anecdotes frequently cause officials with bad incentives to make harmful decisions".

Comment by Joe_Collman on AstraZeneca COVID Vaccine and blood clots · 2021-03-15T15:14:01.295Z · LW · GW

Is there a source that shows there's even a correlation? Please link one if there is - perhaps I missed it. The reports I've seen don't suggest any - e.g. bmj report.

From what (little) I've seen, this seems to be evidence for the hypothesis "Anecdotes frequently cause officials with bad incentives to make harmful decisions".

Comment by Joe_Collman on Trapped Priors As A Basic Problem Of Rationality · 2021-03-14T00:46:32.860Z · LW · GW

I think it's important to distinguish between:

1) Rationality as truth-seeking.
2) Rationality as utility maximization.

For some of the examples these will go together. For others, moving closer to the truth may be a utility loss - e.g. for political zealots whose friends and colleagues tend to be political zealots.

It'd be interesting to see a comparison between such cases. At the least, you'd want to vary the following:

Having a very high prior on X's being true.
Having a strong desire to believe X is true.
Having a strong emotional response to X-style situations.
The expected loss/gain in incorrectly believing X to be true/false.

Cultists and zealots will often have a strong incentive to believe some X even if it's false, so it's not clear the high prior is doing most/much of the work there.

With trauma-based situations, it also seems particularly important to consider utilities: more to lose in incorrectly thinking things are safe, than in incorrectly thinking they're dangerous.
When you start out believing something's almost certainly very dangerous, you may be right. For a human, the utility-maximising move probably is to require more than the 'correct' amount of evidence to shift your belief (given that you're impulsive, foolish, impatient... and so can't necessarily be trusted to act in your own interests with an accurate assessment).

It's also worth noting that habituation can be irrational. If you're repeatedly in a situation where there's good reason to expect a 0.1% risk of death, but nothing bad happens the first 200 times, then you'll likely habituate to under-rate the risk - unless your awareness of the risk makes the experience of the situation appropriately horrifying each time.

On polar bears vs coyotes:

I don't think it's reasonable to label the ...I saw a polar bear... sensation as "evidence for bear". It's weak evidence for bear. It's stronger evidence for the beginning of a joke. For [polar bear] the [earnest report]:[joke] odds ratio is much lower than for [coyote].

I don't think you need to bring in any irrational bias to get this result. There's little shift in belief since it's very weak evidence.

If your friend never makes jokes, then the point may be reasonable. (in particular, for your friend to mistakenly earnestly believe she saw a polar bear, it's reasonable to assume that she already compensated for polar-bear unlikeliness; the same doesn't apply if she's telling a joke)

Comment by Joe_Collman on What I'd change about different philosophy fields · 2021-03-11T23:17:13.251Z · LW · GW

I don't mean that values converge.
I mean that if you take a truth-seeking approach to some fixed set of values, it won't matter whether you start out analysing them through the lens of utility/duty/virtue. In the limit you'll come to the same conclusions.

Comment by Joe_Collman on What I'd change about different philosophy fields · 2021-03-11T23:05:54.801Z · LW · GW

This is interesting. My initial instinct was to disagree, then to think you're pointing to something real... and now I'm unsure :)

First, I don't think your examples directly disagree with what I'm saying. Saying that our preferences can be represented by a UF over histories is not to say that these preferences only care about the physical history of our universe - they can care about non-physical predictions too (desirable anthropic measures and universal-prior-based manipulations included).

So then I assume we say something like:
"This makes our UF representation identical to that of a set of preferences which does only care about the physical history of our universe. Therefore we've lost that caring-about-other-worlds aspect of our values. The UF might fully determine actions in accordance with our values, but it doesn't fully express the values themselves."

Strictly, this seems true to me - but in practice I think we might be guilty of ignoring much of the content of our UF. For example, our UF contains preferences over histories containing philosophy discussions.

Now I claim that it's logically possible for a philosophy discussion to have no significant consequences outside the discussion (I realise this is hard to imagine, but please try).
Our UF will say something about such discussions. If such a UF is both fully consistent with having particular preferences over [anthropic measures, acausal trade, universal-prior-based influence...], and prefers philosophical statements that argue for precisely these preferences, we seem to have to be pretty obtuse to stick with "this is still perfectly consistent with caring only about [histories of the physical world]".

It's always possible to interpret such a UF as encoding only preferences directly about histories of the physical world. It's also possible to think that this post is in Russian, but contains many typos. I submit that это маловероятно.

If we say that the [preferences 'of' a UF] are the [distribution over preferences we'd ascribe to an agent acting according to that UF (over some large set of environments)], then I think we capture the "something substantive" with substantial probability mass in most cases.
(not always through this kind of arguing-for-itself mechanism; the more general point is that the UF contains huge amounts of information, and it'll be surprising if the expression of particular preferences doesn't show up in a priori unlikely patterns)

If we're still losing something, it feels like an epsilon's worth in most cases.
Perhaps there are important edge cases??

Note that I'm only claiming "almost precisely the information you're talking about is in there somewhere", not that the UF is necessarily a useful/efficient/clear way to present the information.
This is exactly the role I endorse for other perspectives: avoiding offensively impractical encodings of things we care about.

A second note: in practice, we're starting out with an uncertain world. Therefore, the inability of a UF over universe histories to express outside-the-universe-history preferences with certainty may not be of real-world relevance. Outside an idealised model, certainty won't happen for any approach.

Comment by Joe_Collman on What I'd change about different philosophy fields · 2021-03-11T19:58:21.932Z · LW · GW

A nit-pick:

Binding exceptionless commitments matter to understanding this complicated thing

I don't think "exceptionless commitments" is a useful category:
Any commitment is exceptionless within some domain.
No commitment is exceptionless over all domains (likely it's not even well-defined).

"exceptionless commitment" just seems like confusion/tautology to me:
So this commitment applies everywhere it applies? Ok.

Saying that it applies "without exception (over some set)" is no simpler than saying it applies "over some set". Either way, the set is almost certainly messy.

In practical terms, claiming a commitment is exceptionless usually means falling victim to an illusion of transparency: thinking that the domain is clear when it isn't (usually even to yourself).

E.g. "I commit never to X". Does this apply:
When I'm sleep deprived? Dreaming? Sleep-walking? Hallucinating? Drunk? How drunk? On drugs? Which drugs? Voluntarily? In pain? How much? Coerced? To what extent? Role-playing? When activity in areas of my brain is suppressed by electric fields? Which areas? To what degree? During/after brain surgery? What kind? After brain injury/disease? What kinds? Under hypnosis? When I've forgotten the commitment, believe I never made it, or honestly don't believe it applies? When my understanding of X changes? When possessed by a demon? When I believe I'm possessed by a demon? When replaced by a clone who didn't make the commitment? When I believe I'm such a clone?... (most of these may impact both whether I X, and whether I believe I am violating my commitment not to X)

For almost all human undertakings X, there are clear violations of "I commit to X", there are clear non-violations, and there's a complex boundary somewhere between the two. Adding "without exception" does not communicate the boundary.

Comment by Joe_Collman on What I'd change about different philosophy fields · 2021-03-11T18:00:51.354Z · LW · GW

Stop picking a 'side' and then losing all interest in the parts of human morality that aren't associated with your 'side': these are all just parts of the stew, and we need to work hard to understand them and reconcile them just right, not sort ourselves into Team Virtue vs. Team Utility vs. Team Duty.

Is anyone serious actually doing this? My sense is that people on a Team believe that all of human morality can be seen from the perspective they've chosen (and that this is correct). This may result in convoluted transformations to fit particular pieces into a given approach. I haven't seen it involve dismissal of anything substantive. (Or when you say "loss of interest", do you only mean focusing elsewhere? Is this a problem? Not everyone can focus on everything.)

E.g. Utility-functions-over-histories can capture any virtue or duty (UFs might look degenerate in some cases, but do exist). The one duty/virtue of "judging according to consequences over histories" captures utility...

For this reason, I don't think " of the stew..." is a good metaphor, or that the biological analogy fits.

Closer might be "Architects have long-standing controversies, but they don't look like 'Which is the right perspective on objects: things that take up a particular space, things that look a particular way, or things with particular structural properties?'."

I don't see it as a problem to focus on using one specific lens - just so long as you don't delude yourself into thinking that a [good simple approximation through your lens] is necessarily a [good simple approximation].

Once the desire for artificial simplicity/elegance is abandoned, I don't think it much matters which lens you're using (they'll tend to converge in the limit). To me, Team Utility seems the most natural: "You need to consider all consequences, in the broadest sense" straightforwardly acknowledges that things are a mess. However, so too would "You need to consider all duties (/virtues), in the broadest sense".

Omit the "...all... in the broadest sense", and you're in trouble on any Team.

Comment by Joe_Collman on To first order, moral realism and moral anti-realism are the same thing · 2021-03-11T01:51:35.588Z · LW · GW

Then always choosing dust specks is worse, for everyone, than always choosing torture.

27602 may beg to differ.

Comment by Joe_Collman on Suggestions of posts on the AF to review · 2021-02-18T18:54:24.664Z · LW · GW

I don't see a good reason to exclude agenda-style posts, but I do think it'd be important to treat them differently from more here-is-a-specific-technical-result posts.

Broadly, we'd want to be improving the top-level collective AI alignment research 'algorithm'. With that in mind, I don't see an area where more feedback/clarification/critique of some kind wouldn't be helpful.
The questions seem to be:
What form should feedback/review... take in a given context?
Where is it most efficient to focus our efforts?

Productive feedback/clarification on high-level agendas seems potentially quite efficient. My worry would be to avoid excessive selection pressure towards paths that are clear and simply justified. However, where an agenda does use specific assumptions and arguments to motivate its direction, early 'review' seems useful.

Comment by Joe_Collman on The Catastrophic Convergence Conjecture · 2021-02-07T23:27:45.257Z · LW · GW

I understand what you mean with the CCC (and that this seems a bit of a nit-pick!), but I think the wording could usefully be clarified.

As you suggest here, the following is what you mean:

CCC says (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking"

However, that's not what the CCC currently says.
E.g. compare:
[Unaligned goals] tend to [have catastrophe-inducing optimal policies] because of [power-seeking incentives].
[People teleported to the moon] tend to [die] because of [lack of oxygen].

The latter doesn't lead to the conclusion: "If people teleported to the moon had oxygen, they wouldn't tend to die."

Your meaning will become clear to anyone who reads this sequence.
For anyone taking a more cursory look, I think it'd be clearer if your clarification were the official CCC:

CCC: (for non-evil goals) "if the optimal policy is catastrophic, then it's because of power-seeking"

Currently, I worry about people pulling an accidental motte-and-bailey on themselves, and thinking that [weak interpretation of CCC] implies [conclusions based on strong interpretation]. (or thinking that you're claiming this)

Comment by Joe_Collman on A Critique of Non-Obstruction · 2021-02-05T02:12:45.450Z · LW · GW

I think things are already fine for any spike outside S, e.g. paperclip maximiser, since non-obstruction doesn't say anything there.

I actually think saying "our goals aren't on a spike" amounts to a stronger version of my [assume humans know what the AI knows as the baseline]. I'm now thinking that neither of these will work, for much the same reason. (see below)

The way I'm imagining spikes within S is like this:
We define a pretty broad S, presumably implicitly, hoping to give ourselves a broad range of non-obstruction.

For all P in U we later conclude that our actual goals are in T U  S.
We optimize for AU on T, overlooking some factors that are important for P in U \ T.
We do better on T than we would have by optimising more broadly over U (we can cut corners in U \ T).
We do worse on U \ T since we weren't directly optimising for that set (AU on U \ T varies quite a lot).
We then get an AU spike within U, peaking on T.

The reason I don't think telling the AI something like "our goals aren't on a spike" will help, is that this would not be a statement about our goals, but about our understanding and competence. It'd be to say that we never optimise for a goal set we mistakenly believe includes our true goals (and that we hit what we aim for similarly well for any target within S).

It amounts to saying something like "We don't have blind-spots", "We won't aim for the wrong target", or, in the terms above, "We will never mistake any T for any U".
In this context, this is stronger and more general than my suggestion of "assume for the baseline that we know everything you know". (lack of that knowledge is just one way to screw up the optimisation target)

In either case, this is equivalent to telling the AI to assume an unrealistically proficient/well-informed pol.
The issue is that, as far as non-obstruction is concerned, the AI can then take actions which have arbitrarily bad consequences for us if we don't perform as well as pol.
I.e. non-obstruction then doesn't provide any AU guarantee if our policy isn't actually that good.

My current intuition is that anything of the form "assume our goals aren't on a spike", "assume we know everything you know"... only avoid creating other serious problems if they're actually true - since then the AI's prediction of pol's performance isn't unrealistically high.

Even for "we know everything you know", that's a high bar if it has to apply when the AI is off.
For "our goals aren't on a spike", it's an even higher bar.

If we could actually make it true that our goals weren't on a spike in this sense, that'd be great.
I don't see any easy way to do that.
[Perhaps if the ability to successfully optimise for S already puts such high demands on our understanding, that distinguishing Ts from Us is comparatively easy.... That seems unlikely to me.]

Comment by Joe_Collman on A Critique of Non-Obstruction · 2021-02-04T02:49:20.290Z · LW · GW

Thinking of corrigibility, it's not clear to me that non-obstruction is quite what I want.
Perhaps a closer version would be something like:
A non-obstructive AI on S needs to do no worse for each P in S than pol(P | off & humans have all the AI's knowledge)

This feels a bit patchy, but in principle it'd fix the most common/obvious issue of the kind I'm raising: that the AI would often otherwise have an incentive to hide information from the users so as to avoid 'obstructing' them when they change their minds.

I think this is more in the spirit of non-obstruction, since it compares the AI's actions to a fully informed human baseline (I'm not claiming it's precise, but in the direction that makes sense to me). Perhaps the extra information does smooth out any undesirable spikes the AI might anticipate.

I do otherwise expect such issues to be common.
But perhaps it's usually about the AI knowing more than the humans.

I may well be wrong about any/all of this, but (unless I'm confused), it's not a quibble about edge cases.
If I'm wrong about default spikiness, then it's much more of an edge case.


(You're right about my P, -P example missing your main point; I just meant it as an example, not as a response to the point you were making with it; I should have realised that would make my overall point less clear, given that interpreting it as a direct response was natural; apologies if that seemed less-than-constructive: not my intent)

Comment by Joe_Collman on A Critique of Non-Obstruction · 2021-02-03T22:29:14.132Z · LW · GW

If pol(P) sucks, even if the AI is literally corrigible, we still won't reach good outcomes.

If pol(P) sucks by default, a general AI (corrigible or otherwise) may be able to give us information I which:
Makes Vp(pol(P)) much higher, by making pol(P) given I suck a whole lot less.
Makes Vq(pol(Q)) a little lower, by making pol(Q) given I make concessions to allow pol(P) to perform better.

A non-obstructive AI can't do that, since it's required to maintain the AU for pol(Q).

A simple example is where P and Q currently look the same to us - so our pol(P) and pol(Q) have the same outcome [ETA for a long time at least, with potentially permanent AU consequences], which happens to be great for Vq(pol(Q)), but not so great for Vp(pol(P)).

In this situation, we want an AI that can tell us:
"You may actually want either P or Q here. Here's an optimisation that works 99% as well for Q, and much better than your current approach for P. Since you don't currently know which you want, this is much better than your current optimisation for Q: that only does 40% as well for P."

A non-obstructive AI cannot give us that information if it predicts it would lower Vq(pol(Q)) in so doing - which it probably would.

Does non-obstruction rule out lowering Vq(pol(Q)) in this way?
If not, I've misunderstood you somewhere.
If so, that's a problem.

I'm not sure I understand the distinction you're making between a "conceptual frame", and a "constraint under which...".

[Non-obstruction with respect to a set S] must be a constraint of some kind.
I'm simply saying that there are cases where it seems to rule out desirable behaviour - e.g. giving us information that allows us to trade a small potential AU penalty for a large potential AU gain, when we're currently uncertain over which is our true payoff function.

Anyway, my brain is now dead. So I doubt I'll be saying much intelligible before tomorrow (if the preceding even qualifies :)).

Comment by Joe_Collman on A Critique of Non-Obstruction · 2021-02-03T20:10:01.484Z · LW · GW

Ok, I think I'm following you (though I am tired, so who knows :)).

For me the crux seems to be:
We can't assume in general that pol(P) isn't terrible at optimising for P. We can "do our best" and still screw up catastrophically.

If assuming "pol(P) is always a good optimiser for P" were actually realistic (and I assume you're not!), then we wouldn't have an alignment problem: we'd be assuming away any possibility of making a catastrophic error.

If we just assume "pol(P) is always  a good optimiser for P" for the purpose of non-obstruction definitions/calculations, then our AI can adopt policies of the following form:

Take actions with the following consequences:

If (humans act according to a policy that optimises well for P) then humans are empowered on P
Otherwise, consequences can be arbitrarily bad

Once the AI's bar on the quality of pol is high, we have no AU guarantees at all if we fail to meet that bar.
This seems like an untenable approach to me, so I'm not assuming that pol is reliably/uniformly good at optimising.

So e.g. in my diagram, I'm assuming that for every P in S, humans screw up and accidentally create the 80s optimiser (let's say the 80s optimiser was released prematurely through an error).
That may be unlikely: the more reasonable proposition would be that this happens for some subset T of S larger than simply P = 80s utopia.
If for all P in T, pol(P) gets you 80s utopia, that will look like a spike on T peaking at P = 80s utopia.

The maximum of this spike may only be achievable by optimising early for 80s utopia (before some period of long reflection that allows us to optimise well across T).

However, once this spike is present for P = 80s utopia, our AI is required by non-obstruction to match that maximum for P = 80s utopia. If it's still possible that we do want 80s utopia when the premature optimisation would start under pol(P) for P in T, the AI is required to support that optimisation - even if the consequences across the rest of T are needlessly suboptimal (relative to what would be possible for the AI; clearly they still improve on pol, because pol wasn't good).

To assume that my claim (2) doesn't hold is to assume that there's no subset T of S where this kind of thing happens by default. That seems unlikely to me - unless we're in a world where the alignment problem gets solved very well without non-obstruction.
For instance, this can happen if you have a payoff function on T which accidentally misses out some component that's valuable-but-not-vital over almost all of T, but zero for one member. You may optimise hard for the zero member, sacrificing the component you missed out, and later realise that you actually wanted the non-zero version.

Personally, I'd guess that this kind of thing would happen over many such subsets, so you'd have a green line with a load of spikes, each negatively impacting a very small part of the line as a trade-off to achieve the high spike.

To take your P vs -P example, the "give money then shut off" only reliably works if we assume pol(P) and pol(-P) are sufficiently good optimisers. (though probably the bar isn't high here)

To take a trivial (but possible) example of its going wrong, imagine that pol(-P) involves using software with some hidden absolute value call that inadvertently converts -P optimisation into P optimisation.

Now giving the money doesn't work, since it makes things worse for V(-p)pol(-P).
The AI can shut off without doing anything, but it can't necessarily do the helpful thing: saying "Hang on a bit and delay optimisation: you need to fix this absolute value bug", unless that delay doesn't cost anything for P optimisation.
This case is probably ok either with a generous epsilon, or the assumption that the AI has the capacity to help either optimisation similarly. But in general there'll be problems of similar form which aren't so simply resolved.

Here I don't like the constraint not to sacrifice a small amount of pol(P) value for a huge amount of pol(Q) value.

Hopefully that's clear. Perhaps I'm still missing something, but I don't see how assuming pol makes no big mistakes gets you very far (the AI is then free to optimise to put us on a knife-edge between chasms, and 'blame' us for falling). Once you allow pol to be a potentially catastrophically bad optimiser for some subset of S, I think you get the problems I outline in the post. I don't think strategy-stealing is much of an issue where pol can screw up badly.

That's the best I can outline my current thinking.
If I'm still not seeing things clearly, I'll have to rethink/regroup/sleep, since my brain is starting to struggle.

Comment by Joe_Collman on Non-Obstruction: A Simple Concept Motivating Corrigibility · 2021-02-03T17:14:54.420Z · LW · GW

I just saw this recently. It's very interesting, but I don't agree with your conclusions (quite possibly because I'm confused and/or overlooking something). I posted a response here.
The short version being:

Either I'm confused, or your green lines should be spikey.
Any extreme green line spikes within S will be a problem.
Pareto is a poor approach if we need to deal with default tall spikes.

Comment by Joe_Collman on A Critique of Non-Obstruction · 2021-02-03T16:42:44.140Z · LW · GW

Oh it's possible to add up a load of spikes [ETA suboptimal optimisations], many of which hit the wrong target, but miraculously cancel out to produce a flat landscape [ETA "spikes" was just wrong; what I mean here is that you could e.g. optimise for A, accidentally hit B, and only get 70% of the ideal value for A... and counterfactually optimise for B, accidentally hit C, and only get 70% of the ideal value for B... and counterfactually aim for C, hit D etc. etc. so things end up miraculously flat; this seems silly because there's no reason to expect all misses to be of similar 'magnitude', or to have the same impact on value]. It's just hugely unlikely. To expect this would seem silly.

[ETA My point is that in practice we'll make mistakes, that the kind/number/severity of our mistakes will be P dependent, and that a pol which assumes away such mistakes isn't useful (at least I don't see how it'd be useful).
Throughout I'm assuming pol(P) isn't near-optimal for all P - see my response above for details]

For non-spikiness, you don't just need a world where we never use powerful AI: you need a world where powerful [optimisers for some goal in S] of any kind don't occur. It's not clear to me how you cleanly/coherently define such a world.
The counterfactual where "this system is off" may not be easy to calculate, but it's conceptually simple.
The counterfactual where "no powerful optimiser for any P in S ever exists" is not. In particular, it's far from clear that iterated improvements of biological humans with increased connectivity don't get you an extremely powerful optimiser - which could (perhaps mistakenly) optimise for something spikey.
Ruling everything like this out doesn't seem to land you anywhere natural or cleanly defined.

Then you have the problem of continuing non-obstruction once many other AIs already exist:
You build a non-obstructive AI, X, using a baseline of no-great-P-in-S-optimisers-ever.
It allows someone else to build Y, a narrow-subset-of-S optimiser (since this outperforms the baseline too).
Y takes decisions to lock in the spike it's optimising for, using irreversible-to-humans actions.
Through non-obstruction at this moment X must switch its policy to enforce the locked-in spike, or shut down. (this is true even if X has the power to counter Y's actions)

Perhaps there's some clean way to take this approach, but I'm not seeing it.
If what you want is to outperform some moderate, flat baseline, then simply say that directly.
Trying to achieve a flat baseline by taking a convoluted counterfactual seems foolish.

Fundamentally, I think setting up an AI with an incentive to prefer (+1, 0, 0, ... , 0) over (-1, +10, +10, ..., +10), is asking for trouble. Pretty-much regardless of the baseline, a rule that says all improvement must be Pareto improvement is just not what I want.

Comment by Joe_Collman on Optimal play in human-judged Debate usually won't answer your question · 2021-01-31T01:25:15.103Z · LW · GW

Sure - there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you're evaluating local nodes with ~1000 characters for each debater.

However, all of this comes under my:
I’ll often omit the caveat “If debate works as intended aside from this issue…

There are many ways for debate to fail. I'm pointing out what happens even if it works.
I.e. I'm claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balanced view of things, and is neither manipulated, nor mind-hacked (unless you believe a response of "Your house is on fire" to "What is 2 + 2?" is malign, if your house is indeed on fire).

Debate can 'work' perfectly, with the judge only ever coming to believe true statements, and your questions will still usually not be answered. (because [believing X is the better answer to the question] and [deciding X should be the winning answer, given the likely consequences] are not the same thing)

The fundamental issue is: [what the judge most wants] is not [the best direct answer to the question asked].

Comment by Joe_Collman on Optimal play in human-judged Debate usually won't answer your question · 2021-01-28T00:59:06.706Z · LW · GW

Oh no need for apologies: I'm certain the post was expressed imperfectly - I was understanding more as I wrote (I hope!). Often the most confusing parts are the most confused.

Since I'm mainly concerned with behaviour-during-training, I don't think the post-training picture is too important to the point I'm making. However, it is interesting to consider what you'd expect to happen after training in the event that the debaters' only convincing "ignore-the-question" arguments are training-signal based.

I think in that case I'd actually expect debaters to stop ignoring the question (assuming they know the training has stopped). I assume that a general, super-human question answerer must be able to do complex reasoning and generalise to new distributions. Removal of the training signal is a significant distributional shift, but one that I'd expect a general question-answerer to handle smoothly (in particular, we're assuming it can answer questions about [optimal debating tactics once training has stopped]).

[ETA: I can imagine related issues with high-value-information bribery in a single debate:
"Give me a win in this branch of the tree, and I'll give you high-value information in another branch", or the like... though it's a strange bargaining situation given that in most setups the debaters have identical information to offer. This could occur during or after training, but only in setups where the judge can give reward before the end of the debate.... Actually I'm not sure on that: if the judge always has the option to override earlier decisions with larger later rewards, then mid-debate rewards don't commit the judge in any meaningful way, so aren't really bargaining chips.
So I don't think this style of bribery would work in setups I've seen.]

Comment by Joe_Collman on Optimal play in human-judged Debate usually won't answer your question · 2021-01-27T16:56:35.437Z · LW · GW

Oh yes - I didn't mean to imply otherwise.

My point is only that there'll be many ways to slide an answer pretty smoothly between [direct answer] and [useful information]. Splitting into [Give direct answer with (k - x) bits] [Give useful information with x bits] and sliding x from 0 to k is just the first option that occurred to me.

In practice, I don't imagine the path actually followed would look like that. I was just sanity-checking by asking myself whether a discontinuous jump is necessary to get to the behaviour I'm suggesting: I'm pretty confident it's not.

Comment by Joe_Collman on Optimal play in human-judged Debate usually won't answer your question · 2021-01-27T15:33:42.567Z · LW · GW

My expectation is that they'd select the more useful result on the basis that it sends a signal to produce useful results in the future - and that a debater would specifically persuade them to do this (potentially over many steps).

I see the situation as analogous to this:
The question-creators, judge and debaters are in the same building.
The building is on fire, in imminent danger of falling off a cliff, at high risk of enraged elephant stampede...
The question-creators, judge and debaters are ignorant of or simply ignoring most such threats.
The question creators have just asked the question "What time should we have lunch?".
Alice answers "There's a fire!!...", persuades the judge that this is true, and that there are many other major threats.
Bob answers "One o'clock would be best...".

There's no need for complex/exotic decision-theoretic reasoning on the part of the judge to conclude:
"The policy which led the debater to inform me about the fire is most likely to point out other threats in future. The actual question is so unimportant relative to this that answering it is crazy. I want to send a training signal encouraging the communication of urgent, life-saving information, and discouraging the wasting of time on trivial questions while the building burns."

Or more simply the judge can just think: "The building's on fire!? Why are you still talking to me about lunch?? I'm picking the sane answer."

Of course the judge doesn't need to come up with this reasoning alone - just to be persuaded of it by a debater. I'm claiming that the kind of judge who'll favour "One o'clock would be best..." while the building burns is a very rare human (potentially non-existent?), and not one whose values we'd want having a large impact.

More fundamentally, to be confident the QIA fails and that you genuinely have a reliable question-answerer, you must be confident that there (usually) exists no compelling argument in favour of a non-answer. I happen to think the one I've given is pretty good, but it'd be a leap from doubting that one to being confident that no compelling similar argument exists.

Comment by Joe_Collman on Debate Minus Factored Cognition · 2020-12-31T01:36:40.742Z · LW · GW

Perhaps I'm misunderstanding you somewhere, but it seems that the requirement you need on defeaters is much stronger than the No Indescribable Hellworld hypothesis.

For a bad argument to be seen to be flawed by a human, we need:
1) A flaw in the argument to be describable to the human in some suitably simplified form.
2) The human to see that this simplified description actually applies to the given argument.

Something like the NIHH seems to give you (1), not (2).
I don't see any reason to think that (2) will apply in general, but you do seem to require it.

Or am I missing your point?

Comment by Joe_Collman on Covid 12/24: We’re F***ed, It’s Over · 2020-12-24T17:19:02.793Z · LW · GW

Agreed. To be fair to Zvi, he did make clear the sense in which he's talking about "value" (those who value them most, as measured by their willingness to pay) [ETA: "their willingness and ability to pay" may have been better], but I fully agree that it's not what most people mean intuitively by value.

I think what people intuitively mean is closer to:
I value X more than you if:
1) I'd pay more for X in my situation than you'd pay for X in my situation.
2) I'd pay more for X in your situation than you'd pay for X in your situation.

(more generally, you could do some kind of summation over all situations in some domain)

The trouble, of course, is that definitions along these lines don't particularly help in constructing efficient systems. (but I don't think anyone was suggesting that they do)

Comment by Joe_Collman on We desperately need to talk more about human challenge trials. · 2020-12-23T00:14:13.145Z · LW · GW

If (2) and (3) were seriously considered, then I'd think you'd particularly want to avoid using only a single vaccine. From a civilizational point of view, the largest issue isn't the expectation of the direct outcome - it's that there's a small chance you may have a bad outcome with very little variance across the population.

I'd be much less concerned about doing (2) or (3) with twenty different vaccines than with one.

Comment by Joe_Collman on Covid 12/17: The First Dose · 2020-12-17T21:47:00.112Z · LW · GW

It's also worth looking at the next table for Moderna one-dose severe-COVID-prevention efficacy:
Vaccine group: 2 / 996
Control group: 4 / 1079
Efficacy: 42.6% (-300.8, 94.8) [95% CI]

Huge error bars and little data, but certainly doesn't support a guess of ~80% efficacy at preventing severe cases. In the end it's the transmission that matters, but I suppose there's a danger based on public perception: if one dose turns out to have under 50% efficacy for severe cases it's not going to make anyone feel safe. If the sub 50% applies to deaths too, then you'll have many reports of "X took the vaccine, caught Covid and then died".
I assume Moderna wouldn't be crazy about this either. Not great PR if everyone broadly remembers that vaccines stopped Covid, but specifically remembers that Moderna's failed to save their friend's granny.

While there's short supply, it doesn't particularly matter if a load of people don't want to take it. Once there's a large supply, that changes - and if there's a largely baked-in misperception that the vaccine(s) suck(s), it's likely to be unhelpful.

In some sense it's analogous to the mask situation:
[Take action likely to reduce confidence in X] ---> [Free up supply of X to allow efficient targeting] ---> [Suffer consequences of longer-term low confidence in X]

Here the confidence-reducing action wouldn't be a lie, but that's not the only consideration.

Comment by Joe_Collman on Covid 12/17: The First Dose · 2020-12-17T20:54:49.406Z · LW · GW

Ah yes, I think you're right.
To me it seems that one dose efficacy is approx 80% from that table, and the two dose is still the old approx 95%. So it's more like an 80% to 95% upgrade than 87% to 97%.

Zvi's main point likely still stands, but the personal immunity question is less clear [ETA even on a population level it's somewhat less clear, once you consider the confidence intervals: given 55% to 92% CI, one-shot efficacy could turn out to be below 70%, in which case things depend a lot on the homogeneity of populations, the precision of your targeting, and post vaccination behaviour changes]

Comment by Joe_Collman on Covid 12/17: The First Dose · 2020-12-17T19:40:11.199Z · LW · GW

My best guess on that table, looking at the full report (caveat: I am emphatically not an expert):
1) The VE calculations look correct: they're almost precisely what I get by division of my naïve incidence rate calculations. I assume the small discrepancy is due to the data's being discrete: if you have 7 cases out of 996, your best prediction of incidence rates won't be precisely 7/996.

2) From my guess the numbers in brackets in the first two columns aren't percentage rates at all. Rather they are "Surveillance time in person years for given endpoint across all participants within each group at risk for the endpoint". This description is at the bottom of the table, without any asterisk or similar. I assume that this is an error: there was supposed to be an asterisk for that from the bracketed number in the first two columns.

This seems plausible for the data: the pre-14-days numbers are under half of the post-14-days numbers, and the median follow-up time was 28 days.
But it's entirely possible that I'm wrong.

Comment by Joe_Collman on Covid 12/17: The First Dose · 2020-12-17T16:39:11.957Z · LW · GW

Thanks again for these.

Typo: " negative to administer the virus".

Comment by Joe_Collman on Small Habits Shape Identity: How I became someone who exercises · 2020-11-26T17:45:53.091Z · LW · GW

It's a good book.

"Influence: the psychology of persuasion" has some useful ideas on identity formation too. In particular, the observation that your brain is looking for explanations for your own actions. When you do X it's likely to use "I'm the kind of person who does X" only if it can't find some strong external reason for you to have done X. The stronger the external motivation, the weaker the influence on your identity.

I think this is another reason the 2-minute approach is likely to be effective. The 2-minute version not contributing significantly to the outcome isn't either a bug or irrelevant: it's a feature.

It's denying your brain the outcome-based explanation, leaving it with the identity-building explanation.

Comment by Joe_Collman on When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation · 2020-11-10T01:56:06.632Z · LW · GW

Right, but any such trash-car-for-net-win opportunity for Bob will make Alice less likely to make the deal: from her perspective, Bob taking such a win is equivalent to accident/carelessness. In the car case, I'd imagine this is a rare scenario relative to accident/carelessness; in the general case it may not be.

Perhaps a reasonable approach would be to split bills evenly, with each paying 50% and burning an extra k%, where k is given by some increasing function of the total repair cost so far.

I think this gives better incentives overall: with an increasing function, it's dangerous for Alice to hide problems, given that she doesn't know Bob will be careful. It's dangerous for Bob to be careless (or even to drive through swamps for rewards) when he doesn't know whether there are other hidden problems.


I don't think you can use the "Or donates to a third party they don’t especially like" version: if trust doesn't exist, you can't trust Alice/Bob to tell the truth about which third parties they don't especially like.
You do seem to need to burn the money (and to hope that Alice doesn't enjoy watching piles of money burn).

Comment by Joe_Collman on Covid 10/8: October Surprise · 2020-10-08T19:02:17.954Z · LW · GW

Thanks, particularly for the aerosol FAQ link.

Mostly harmless typo: ...because ‘they don’t expect to test negative.’

Comment by Joe_Collman on Competence vs Alignment · 2020-10-05T22:22:18.550Z · LW · GW

Intuitively it's an interesting question; its meaning and precise answer will depend on how you define things.
Here are a few related thoughts you might find interesting:
Focus post from Thoughts on goal directedness
Online Bayesian Goal Inference
Vanessa Kosoy's behavioural measure of goal directed intelligence

In general, I'd expect the answer to be 'no' for most reasonable formalisations. I'd imagine it'd be more workable to say something like:
The observed behaviour is consistent with this set of capability/alignment combinations.
More than that is asking a lot.

For a simple value function, it seems like there'd be cases where you'd want to say it was clearly misaligned (e.g. a chess AI that consistently gets to winning mate-in-1 positions, then resigns, is 'clearly' misaligned with respect to winning at chess).
This gets pretty shaky for less obvious situations though - it's not clear in general what we'd mean by actions a system could have taken when its code deterministically fails to take them.

You might think in terms of the difficulty of doing what the system does vs optimising the correct value function, but then it depends on the two being difficult along the same axis. (winning at chess and getting to mate-in-one positions are 'clearly' difficult along the same axis, but I don't have a non-hand-waving way to describe such similarity in general)

Things aren't symmetric for alignment though - without other assumptions, you can't get 'clear' alignment by observing behaviour (there are always many possible motives for passing any behavioural test, most of which needn't be aligned).
Though if by "observing its behaviour" you mean in the more complete sense of knowing its full policy, you can get things like Vanessa's construction.