GPT-3: a disappointing paper 2020-05-29T19:06:27.589Z · score: 40 (39 votes)
covid-19 notes, 4/19/20 2020-04-20T05:30:01.873Z · score: 29 (11 votes)
mind viruses about body viruses 2020-03-28T04:20:02.674Z · score: 56 (21 votes)
human psycholinguists: a critical appraisal 2019-12-31T00:20:01.330Z · score: 167 (62 votes)
“embedded self-justification,” or something like that 2019-11-03T03:20:01.848Z · score: 41 (14 votes)
When does rationality-as-search have nontrivial implications? 2018-11-04T22:42:01.452Z · score: 64 (23 votes)


Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-03T03:24:24.245Z · score: 5 (3 votes) · LW · GW

On the reading of the graphs:

All I can say is "I read them differently and I don't think further discussion of the 'right' way to read them would be productive."

Something that might make my perspective clear:

  • when I first read this comment, I thought "whoa, that 'phase change' point seems fair and important, maybe I just wasn't looking for that in the graphs"
  • and then I went back and looked at the graphs and thought "oh, no, that's obviously not distinguishable from noise; that's the kind of non-monotonic bouncing around that I expect when you need more data per plotted point to get a reasonable estimate; that Squad V2 graph looks like the other 5 reading comp graphs except with more noise," etc. etc.

I don't expect this will convince you I'm right, but the distance here seems more about generic "how to interpret plots in papers" stuff than anything interesting about GPT-3.

On this:

I can't think of a coherent model where both of these claims are simultaneously true; if you have one, I'd certainly be interested in hearing what it is.

Roughly, my position is that transformer LMs are very impressive and know all sorts of things, even at small scale, although they know them "less noisily" as the scale grows.

The intended connotation of my stance that "fine-tuning will outperform few-shot" is not "haha, transformers are limited, they will never stand on their own without supervised training, boo transformers!" If anything, it's the opposite:

  • I think transformers have some limits (e.g. physical / spatial stuff). But, already at the 1.5B scale if not before, they display a very-real-if-noisy understanding of the linguistic phenomena probed by most NLP benchmarks.
  • I think fine-tuning has shown itself to be a remarkably effective way to "get at" this knowledge for downstream tasks -- even with small data sets, not far in scale from the "data sets" used in few-shot.
  • So, I don't understand what few-shot gets us in terms of ways to probe transformer understanding (we already had a great one) or as a demo of language understanding (what I see in my own generation experiments, at two orders of magnitude lower, impresses me far more than the few-shot results).

Again, I engage with this stuff foremost as someone who is very impressed transformer LMs as text generators and has interacted with them a lot in that modality.

So, this all feels a bit like being a dog owner who reads a new paper "demonstrating dogs' capacity for empathy with humans," is unimpressed w/ it's methodology, and finds themselves arguing over what concrete model of "dog empathy" they hold and what it predicts for the currently popular "dog empathy" proxy metrics, with a background assumption that they're some sort of dog-empathy-skeptic.

When in fact -- they believe that of course their dog empathizes with them, and they find the methodology of the paper awkwardly under-equipped to explore this complex, and very clearly real, phenomenon.

I've already seen GPT-2 display vast declarative knowledge and use words in subtle context-dependent ways, and pick up the many-faceted nuances implied in a prompt, and all those things. When I see it again, but with ~100x parameters, and in a contrived experimental setting where ~1.5B models technically fare poorly even if I've seen them do that kind of thing in real life . . . should I be impressed?

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-01T19:53:18.684Z · score: 8 (5 votes) · LW · GW

I agree with you about hype management in general, I think. The following does seem like a point of concrete disagreement:

It sounds like you expected "GPT" to mean something more like "paradigm-breaker" and so you were disappointed, but this feels like a ding on your expectations more than a ding on the paper.

If the paper had not done few-shot learning, and had just reviewed LM task performance / generation quality / zero-shot (note that zero-shot scales up well too!), I would agree with you.

However, as I read the paper, it touts few-shot as this new, exciting capability that only properly emerges at the new scale. I expected that, if any given person found the paper impressive, it would be for this purported newness and not only "LM scaling continues," and this does seem to be the case (e.g. gwern, dxu). So there is a real, object-level dispute over the extent to which this is a qualitative jump.

I'm not sure I have concrete social epistemology goals except "fewer false beliefs" -- that is, I am concerned with group beliefs, but only because they point to which truths will be most impactful to voice. I predicted people would be overly impressed with few-shot, and I wanted to counter that. Arguably I should have concentrated less on "does this deserve the title GPT-3?" and more heavily on few-shot, as I've done more recently.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-06-01T15:48:59.151Z · score: 9 (5 votes) · LW · GW
Are there bits of evidence against general reasoning ability in GPT-3? Any answers it gives that it would obviously not give if it had a shred of general reasoning ability?

In the post I gestured towards the first test I would want to do here -- compare its performance on arithmetic to its performance on various "fake arithmetics." If #2 is the mechanism for its arithmetic performance, then I'd expect fake arithmetic performance which

  • is roughy comparable to real arithmetic performance (perhaps a bit worse but not vastly so)
  • is at least far above random guessing
  • more closely correlates with the compressibility / complexity of the formal system than with its closeness to real arithmetic

BTW, I want to reiterate that #2 is about non-linguistic general reasoning, the ability to few-shot learn generic formal systems with no relation to English. So the analogies and novel words results seem irrelevant here, although word scramble results may be relevant, as dmtea says.


There's something else I keep wanting to say, because it's had a large influence on my beliefs, but is hard to phrase in an objective-sounding way . . . I've had a lot of experience with GPT-2:

  • I was playing around with fine-tuning soon after 117M was released, and jumped to each of the three larger versions shortly after its release. I have done fine-tuning with at least 11 different text corpora I prepared myself.
  • All this energy for GPT-2 hobby work eventually convergent into my tumblr bot, which uses a fine-tuned 1.5B with domain-specific encoding choices and a custom sampling strategy ("middle-p"), and generates 10-20 candidate samples per post which are then scored by a separate BERT model optimizing for user engagement and a sentiment model to constrain tone. It's made over 5000 posts so far and continues to make 15+ / day.

So, I think have a certain intimate familiarity with GPT-2 -- what it "feels like" across the 4 released sizes and across numerous fine-tuning / sampling / etc strategies on many corpora -- that can't be acquired just by reading papers. And I think this makes me less impressed with arithmetic and other synthetic results than some people.

I regularly see my own GPT-2s do all sorts of cool tricks somewhat similar to these (in fact the biggest surprise here is how far you have to scale to get few-shot arithmetic!), and yet there are also difficult-to-summarize patterns of failure and ignorance which are remarkably resistant to scaling across the 117M-to-1.5B range. (Indeed, the qualitative difference across that range is far smaller than I had expected when only 117M was out.) GPT-2 feels like a very familiar "character" to me by now, and I saw that "character" persist across the staged release without qualitative jumps. I still wait for evidence that convinces me 175B is a new "character" and not my old, goofy friend with another lovely makeover.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-31T04:25:36.775Z · score: 13 (10 votes) · LW · GW
what, in my view, are the primary implications of the GPT-3 paper--namely, what it says about the viability of few-shot prediction as model capacity continues to increase

This seems like one crux of our disagreement. If I thought the paper shows a clear trend, with room to grow, toward much greater performance few-shot learning with even bigger models, I would be more impressed with "few-shot + large LM" as an approach.

I don't think it shows that. The clearest evidence on this subject, IMO, is the many plots in their Appendix H. On a large fraction of the individual downstream tasks, few-shot learning has either

  • a scaling trend with a clearly defined shape that is mostly flat by the 175B point, with a remaining gap vs. fine-tuning that seems unlike to be closed (examples: WiC, MultiRC, ReCoRD, PhysicaQA, OpenBookQA, at least 5 of the 6 reading comprehension tasks, ANLI)
  • a very noisy trend where, due to noise, returns to scale might be large but might just as well be near zero (examples: BoolQ, CB, WSC)

The scaling trend is more encouraging on certain downstream tasks (COPA, ARC, Winogrande, many the MT tasks), on "less downstream" tasks that essentially probe language modeling skill in a different way (cloze/completion), and on synthetic tasks.

On average, there is a trend toward slow but steady growth with scale (Fig 1.3), but this masks the great across-task variance catalogued above. The scaling picture for few-shot is very different from the scaling picture for LM loss itself, which as catalogued in another OpenAI paper is remarkably smooth and predictable, and which (as GPT-3 shows) continues smoothly to 175B.

I find it difficult to express just what I find unimpressive here without further knowledge of your position. (There is an asymmetry: "there is value in this paper" is a there-exists-an-x claim, while "there is no value in this paper" is a for-all-x claim. I'm not arguing for-all-x, only that I have not seen any x yet.)

All I can do is enumerate and strike out all the "x"s I can think of. Does few-shot learning look promising in the scaling limit?

  • As a tool for humans: no, I expect fine-tuning will always be preferred.
  • As a demonstration that transformers are very generic reasoners: no, we still see a wide spread of task performance despite smooth gains in LM loss, with some of the most distinctive deficits persisting at all scales (common sense physics, cf section 5), and some very basic capabilities only emerging at very large scale and noisily even there (arithmetic).
  • As an AGI component: no. Because few-shot learning on most tasks shows no clear scaling trend toward human level, any role of transformers in AGI will require more effective ways of querying them (such as fine-tuning controlled by another module), or non-transformer models.
Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-31T01:57:20.820Z · score: 11 (10 votes) · LW · GW

Since I'm not feigning ignorance -- I was genuinely curious to hear your view of the paper -- there's little I can do to productively continue this conversation.

Responding mainly to register (in case there's any doubt) that I don't agree with your account of my beliefs and motivations, and also to register my surprise at the confidence with which you assert things I know to be false.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-30T20:50:31.785Z · score: 4 (5 votes) · LW · GW

Perhaps I wasn't clear -- when I cited my experience as an ML practitioner, I did so in support of a claim about whether the stated capabilities of GPT-3 sound useful, not as a point about what those capabilities are.

I don't think the practical value of very new techniques is impossible to estimate. For example, the value of BERT was very clear in the paper that introduced it: it was obvious that this was a strictly better way to do supervised NLP, and it was quickly and widely adopted.

(I suppose it's conceivable that few-shot learning with a large model is "secretly useful" in some way not conveyed in the paper, but that's true of any paper, so if this proves anything then it proves too much.)

A smell test: what do you think your past experience would have predicted about the performance of a 175B-parameter model in advance?

Above I argued this question was orthogonal to my point, but to answer it anyway: I'd certainly predict better performance on LM tasks, as a simple extrapolation of the existing "biggening" research (GPT-2 at 1.5B parameters, Megatron-LM at 8.3B, T5 at 11B, T-NLG at 17B).

For downstream tasks, I'd expect similar scaling: certainly with fine-tuning (given T5's success on SuperGLUE) though GPT-3 was not fine-tuned, and also with unsupervised approaches (zero-shot, few-shot) given the reported scaling of GPT-2 zero-shot with model size (GPT-2 Fig 1).

I also would have predicted that fine-tuning still out-performs unsupervised approaches by a large margin on most tasks, a gap we observe with unsupervised GPT-3 vs. fine-tuned smaller models (presumably comparing to fine-tuned 175B models would yield an even larger gap).

I alluded to all this in the post, as did the GPT-3 authors in their paper: the results demonstrate that existing trends continue up to 175B. As Daniel Kokotajlo says, the new observation confirms an already familiar, though previously untested, prediction.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-29T21:01:55.886Z · score: 14 (13 votes) · LW · GW

It sounds like you think I'm nitpicking relatively minor points while ignoring the main significance of the paper. What do you think that main significance is?

I can see an argument that the value of few-shot LM prediction is its potential flexibility as a generic tool -- it can presumably do many tasks that are not standard benchmarks, weren't in the paper, etc.

Given my past ML experience, this just doesn't sound that promising to me, which may be our disconnect. In practical work I tend to find that a few days' work preparing a supervised dataset on my exact problem domain beats anything I can produce without that dataset. Few-shot learning apparently trades that few days of work for another non-zero time investment (finding the right prompt and output-reading methodology), generally worse performance, and (pending distillation successes) vastly larger compute requirements.

Comment by nostalgebraist on GPT-3: a disappointing paper · 2020-05-29T19:57:18.240Z · score: 14 (9 votes) · LW · GW

If one ignores the "GPT-3" terminology, then yeah, it's a perfectly decent scaling-up-transformers paper similar to the others that have come out in the last few years. (A paper with some flaws, but that's not surprising.)

But, I would be very surprised if there isn't a lot of hype about this paper -- hype largely due to the "GPT-3" term, and the inappropriate expectations it sets. People are naturally going to think "GPT-3" is as much of a step forward as "GPT-2" was, and it isn't. I take a critical tone here in an effort to cut that hype off at the pass.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-13T17:36:34.443Z · score: 3 (2 votes) · LW · GW

I should say first that I completely agree with you about the extreme data inefficiency of many systems that get enthusiastically labeled "AI" these days -- it is a big problem which calls into question many claims about these systems and their displays of "intelligence."

Especially a few years ago (the field has been getting better about this over time), there was a tendency to define performance with reference to some set collection of tasks similar to the training task without acknowledging that broader generalization capacity, and generalization speed in terms of "number of data points needed to learn the general rule," are key components of any intuitive/familiar notion of intelligence. I've written about this in a few places, like the last few sections of this post, where I talk about the "strange simpletons."

However, it's not clear to me that this limitation is inherent to neural nets or to "AI" in the way you seem to be saying. You write:

Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.

If I understand you correctly, you're taking a position that Marcus argued against in The Algebraic Mind. I'm taking Marcus' arguments there largely as a given in this post, because I agree with them and because I was interested specifically in the way Marcus' Algebraic Mind arguments cut against Marcus' own views about deep learning today.

If you want to question the Algebraic Mind stuff itself, that's fine, but if so you're disagreeing with both me and Marcus more fundamentally than (I think) Marcus and I disagree with one another, and you'll need a more fleshed-out argument if you want to bridge a gulf of this size.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T20:07:54.391Z · score: 12 (3 votes) · LW · GW

The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase "word choice."

If "word choice" just means something narrow like "selecting which noun you want to use, given that you are picking the inhabitant of a 'slot' in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey," then perhaps priming and other results about perceptions of "word similarity" might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I'm aware) believe GPT-2 works like this.

If "word choice" means something bigger that encompasses syntax, then priming experiments about single words don't tell us much about it.

I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2's stylistic fluency is less surprising than its syntactic fluency. In fact, I think that's true -- intellectually, I am more impressed by the syntax than the style.

But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam's Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.

It's more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena -- as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style "for free" out of a model that also does good syntax (or, if you prefer, good syntax "for free" out of a model that also does good style) suggests we might be scientifically on the right track.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T19:32:47.925Z · score: 4 (2 votes) · LW · GW

For the full argument from Marcus, read the parts about "training independence" in The Algebraic Mind ch. 2, or in the paper it draws from, "Rethinking Eliminative Connectionism."

The gist is really simple, though. First, note that if some input node is always zero during training, that's equivalent to it not being there at all: their contribution to the input of any node in the first hidden layer is the relevant weight times zero, which is zero. Likewise, the gradient of anything w/r/t these weights is zero (because you'll always multiply by zero when doing the chain rule), so they'll never get updated from their initial values.

Then observe that, if the nodes are any nonzero constant value during training, the connections add a constant to the first hidden layer inputs instead of zero. But we already have a parameter for an additive constant in a hidden layer input: the "bias." So if the input node is supposed to carry some information, the network still can't learn what it is; it just thinks it's updating the bias. (Indeed, you can go the other way and rewrite the bias as an extra input node that's always constant, or as N such nodes.)

The argument for constant outputs is even simpler: the network will just set the weights and bias to something that always yields the right constant. For example, it'd work to set the weights to zero and the bias to where is the activation function and is the constant. If the output has any relationship to the input then this is wrong, but the training data plus the update rule give you no reason to reject it.

None of this is controversial and it does indeed become obvious once you think about it enough; this kind of idea is much of the rationale for weight sharing, which sets the weights for constant input nodes using patterns learned from non-constant ones rather than randomly/arbitrarily.

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-05T18:56:36.124Z · score: 2 (2 votes) · LW · GW

Hmm... I think you are technically right, since "compositionality" is typically defined as a property of the way phrases/sentences/etc. in a language relate to their meanings. Since language modeling is a task defined in terms of words, without involving their referents at all, GPT-2 indeed does not model/exhibit this property of the way languages mean things.

But the same applies identically to every property of the way languages mean things! So if this is really the argument, there's no reason to focus specifically on "compositionality." On the one hand, we would never expect to get compositionality out of any language model, whether a "deep learning" model or some other kind. On the other hand, the argument would fail for any deep learning model that has to connect words with their referents, like one of those models that writes captions for images.

If we read the passage I quoted from 2019!Marcus in this way, it's a trivially true point about GPT-2 that he immediately generalizes to a trivially false point about deep learning. I think when I originally read the passage, I just assumed he couldn't possibly mean this, and jumped to another interpretation: he's saying that deep learning lacks the capacity for structured representations, which would imply an inability to model compositionality even when needed (e.g. when doing image captioning as opposed to language modeling).

Fittingly, when he goes on to describe the problem, it doesn't sound like he's talking about meaning but about having flat rather than hierarchical representations:

Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure.

In The Algebraic Mind, Marcus critiqued some connectionist models on the grounds that they cannot support "structured representations." Chapter 4 of the book is called "Structured Representations" and is all about this, mostly focused on meaning (he talks a lot about "structured knowledge") but not at all tied to meaning specifically. Syntax and semantics are treated as equally in need of hierarchical representations, equally impossible without them, and equally possible with them.

Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge. So when 2019!Marcus wrote about "compositionality," I assumed that he was making this argument, again, about GPT-2. In that case, GPT-2's proficiency with syntax alone is a relevant datum, because Marcus and conventional linguists believe that syntax alone requires structured representations (as against some of the connectionists, who didn't).

Comment by nostalgebraist on human psycholinguists: a critical appraisal · 2020-01-02T18:53:34.981Z · score: 5 (4 votes) · LW · GW
In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).

I think it's misleading to view "amount of inductive bias" as a one-dimensional scale, with the transformer somewhere "between" CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once -- weight sharing between positions, and locality -- and these are two very different things, not just two (perhaps differently sized) injections of "more bias" on our hypothetical 1D bias scale.

For example, locality without weight sharing is certainly conceivable (I can't remember if I've seen it before), but I'd imagine it would do very poorly on text data, because it relaxes the CNN constraint that's appropriate for text while keeping the one that's inappropriate. If you compare that to the transformer, you've got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture's representational aptness for a given domain isn't just a function of some 1D "amount of inductive bias" in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.

As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that "moving to a superset" shouldn't be simplified to "reducing some 1D 'bias' variable," I'd also say that "moving to a superset" isn't what happened anyway.

Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?

Disagree. Not that this seems deeply impossible or anything, but it's exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there's less incentive to do it.

On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T08:35:43.544Z · score: 7 (5 votes) · LW · GW

Thanks, the floor/ceiling distinction is helpful.

I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:

  • any resource-bound agent will have ceilings, so an account of embedded rationality needs a "theory of having good ceilings"
  • a "theory of having good ceilings" would be different from the sorts of "theories" we're used to thinking about, involving practical concerns at the fundamental desiderata level rather than as a matter of implementing an ideal after it's been specified

In more detail: it's one thing to be able to assess quick heuristics, and it's another (and better) one to be able to assess quick heuristics quickly. It's possible (maybe) to imagine a convenient situation where the theory of each "speed class" among fast decisions is compressible enough to distill down to something which can be run in that speed class and still provide useful guidance. In this case there's a possibility for the theory to tell us why our behavior as a whole is justified, by explaining how our choices are "about as good as can be hoped for" during necessarily fast/simple activity that can't possibly meet our more powerful and familiar notions of decision rationality.

However, if we can't do this, it seems like we face an exploding backlog of justification needs: every application of a fast heuristic now requires a slow justification pass, but we're constantly applying fast heuristics and there's no room for the slow pass to catch up. So maybe a stronger agent could justify what we do, but we couldn't.

I expect helpful theories here to involve distilling-into-fast-enough-rules on a fundamental level, so that "an impractically slow but working version of the theory" is actually a contradiction in terms.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T07:14:45.998Z · score: 1 (1 votes) · LW · GW

I don't understand Thing #1. Perhaps, in the passage you quote from my post, the phrase "decision procedure" sounds misleadingly generic, as if I have some single function I use to make all my decisions (big and small) and we are talking about modifications to that function.

(I don't think that is really possible: if the function is sophisticated enough to actually work in general, it must have a lot of internal sub-structure, and the smaller things it does inside itself could be treated as "decisions" that aren't being made using the whole function, which contradicts the original premise.)

Instead, I'm just talking about the ordinary sort of case where you shift some resources away from doing X to thinking about better ways to do X, where X isn't the whole of everything you do.

Re: Q/A/A1, I guess I agree that these things are (as best I can tell) inevitably pragmatic. And that, as EY says in the post you link, "I'm managing the recursion to the best of my ability" can mean something better than just "I work on exactly N levels and then my decisions at level N+1 are utterly arbitrary." But then this seems to threaten the Embedded Agency programme, because it would mean we can't make theoretically grounded assessments or comparisons involving agents as strong as ourselves or stronger.

(The discussion of self-justification in this post was originally motivated by the topic of external assessment, on the premise that if we are powerful enough to assess a proposed AGI in a given way, it must also be powerful enough to assess itself in that way. And contrapositively, if the AGI can't assess itself in a given way then we can't assess it in that way either.)

Comment by nostalgebraist on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T06:00:20.806Z · score: 9 (5 votes) · LW · GW

I don't see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. "cheap win" strategies that are easily defeated by serious players and hence never come up in serious play.

Comment by nostalgebraist on “embedded self-justification,” or something like that · 2019-11-03T05:29:41.568Z · score: 3 (2 votes) · LW · GW

It's not really about doing well/better in all domains, it's about being able to explain how you can do well at all of the things you do, even if that isn't nearly everything. And making that explanation complete enough to be convincing, as an argument about the real world assessed using your usual standards, while still keeping it limited enough to avoid self-reference problems.

Comment by nostalgebraist on AlphaStar: Impressive for RL progress, not for AGI progress · 2019-11-03T04:24:28.122Z · score: 6 (4 votes) · LW · GW

IIUC the distinction being made is about the training data, granted the assumption that you may be able to distill tree-search-like abilities into a standard NN with supervised learning if you have samples from tree search available as supervision targets in the first place.

AGZ was hooked up to a tree search in its training procedure, so its training signal allowed it to learn not just from the game trees it "really experienced" during self-play episodes but also (in a less direct way) from the much larger pool of game trees it "imagined" while searching for its next move during those same episodes. The former is always (definitionally) available in self-play, but the latter is only available if tree search is feasible.

Comment by nostalgebraist on Embedded World-Models · 2019-07-06T23:19:55.694Z · score: 10 (3 votes) · LW · GW
I think you're confusing behavior with implementation.

I'm definitely not treating these as interchangeable -- my argument is about how, in a certain set of cases, they are importantly not interchangeable.

Specifically, I'm arguing that certain characterizations of ideal behavior cannot help us explain why any given implementation approximates that behavior well or poorly.

I don't understand how the rest of your points engage with my argument. Yes, there is a good reason Solomonoff does a weighted average and not an argmax; I don't see how this affects my argument one way or the other. Yes, fully general theories can be valuable even when they're not practical to apply directly to real problems; I was arguing that a specific type of fully general theory lacks a specific type of practical value, one which people sometimes expect that type of theory to have.

Comment by nostalgebraist on When does rationality-as-search have nontrivial implications? · 2018-11-09T05:33:35.367Z · score: 7 (4 votes) · LW · GW
But it seems like the core strategy--be both doing object-level cognition and meta-level cognition about how you're doing object-level cognitive--is basically the same.
It remains unclear to me whether the right way to find these meta-strategies is something like "start at the impractical ideal and rescue what you can" or "start with something that works and build new features"; it seems like modern computational Bayesian methods look more like the former than the latter.

I'd argue that there's usually a causal arrow from practical lore to impractical ideals first, even if the ideals also influence practice at a later stage. Occam's Razor came before Solomonoff; "change your mind when you see surprising new evidence" came before formal Bayes. The "core strategy" you refer to sounds like "do both exploration and exploitation," which is the sort of idea I'd imagine goes back millennia (albeit not in those exact terms).

One of my goals in writing this post was to formalize the feeling I get, when I think about an idealized theory of this kind, that it's a "redundant step" added on top of something that already does all the work by itself -- like taking a decision theory and appending the rule "take the actions this theory says to take." But rather than being transparently vacuous, like that example, they are vacuous in a more hidden way, and the redundant steps they add tend to resemble legitimately good ideas familiar from practical experience.

Consider the following (ridiculous) theory of rationality: "do the most rational thing, and also, remember to stay hydrated :)". In a certain inane sense, most rational behavior "conforms to" this theory, since the theory parasitizes on whatever existing notion of rationality you had, and staying hydrated is generally a good idea and thus does not tend to conflict with rationality. And whenever staying hydrated is a good idea, one could imagine pointing to this theory and saying "see, there's the hydration theory of rationality at work again." But, of course, none of this should actually count in the "hydration theory's" favor: all the real work is hidden in the first step ("do the most rational thing"), and insofar as hydration is rational, there's no need to specify it explicitly. This doesn't quite map onto the schema, but captures the way in which I think these theories tend to confuse people.

If the more serious ideals we're talking about are like the "hydration theory," we'd expect them to have the appearance of explaining existing practical methods, and of retrospectively explaining the success of new methods, while not being very useful for generating any new methods. And this seems generally true to me: there's a lot of ensemble-like or regularization-like stuff in ML that can be interpreted as Bayesian averaging/updating over some base space of models, but most of the excitement in ML is in these base spaces. We didn't get neural networks from Bayesian first principles.

Comment by nostalgebraist on Subsystem Alignment · 2018-11-07T17:02:06.988Z · score: 4 (5 votes) · LW · GW

Does "subsystem alignment" cover every instance of a Goodhart problem in agent design, or just a special class of problems that arises when the sub-systems are sufficiently intelligent?

As stated, that's a purely semantic question, but I'm concerned with a more-than-semantic issue here. When we're talking about all Goodhart problems in agent design, we're talking about a class of problems that already comes up in all sorts of practical engineering, and which can be satisfactorily handled in many real cases without needing any philosophical advances. When I make ML models at work, I worry about overfitting and about misalignments between the loss function and my true goals, but it's usually easy to place bounds on how much trouble these things can cause. Unlike humans interacting with "evolution," my models don't live in a messy physical world with porous boundaries; they can only control their output channel, and it's easy to place safety restrictions on the output of that channel, outside the model. This is like "boxing the AI," but my "AI" is so dumb that this is clearly safe. (We could get even clearer examples by looking at non-ML engineers building components that no one would call AI.)

Now, once the subsystem is "intelligent enough," maybe we have something like a boxed AGI, with the usual boxed AGI worries. But it doesn't seem obvious to me that "the usual boxed AGI worries" have to carry over to this case. Making a subsystem strikes me as a more favorable case for "tool AI" arguments than making something with a direct interface to physical reality, since you have more control over what the output channel does and does not influence, and the task may be achievable even with a very limited input channel. (As an example, one of the ML models I work on has an output channel that just looks like "show a subset of these things to the user"; if you replaced it with a literal superhuman AGI, but kept the output channel the same, not much could go wrong. This isn't the kind of output channel we'd expect to hook up to a real AGI, but that's my point: sometimes what you want out of your subsystem just isn't rich enough to make boxing fail, and maybe that's enough.)

Comment by nostalgebraist on When does rationality-as-search have nontrivial implications? · 2018-11-06T02:02:20.647Z · score: 9 (3 votes) · LW · GW

I was not aware of these results -- thanks. I'd glanced at the papers on reflective oracles but mentally filed them as just about game theory, when of course they are really very relevant to the sort of thing I am concerned with here.

We have a remaining semantic disagreement. I think you're using "embeddedness" quite differently than it's used in the "Embedded World-Models" post. For example, in that post (text version):

In a traditional Bayesian framework, “learning” means Bayesian updating. But as we noted, Bayesian updating requires that the agent start out large enough to consider a bunch of ways the world can be, and learn by ruling some of these out.

Embedded agents need resource-limited, logically uncertain updates, which don’t work like this.

Unfortunately, Bayesian updating is the main way we know how to think about an agent progressing through time as one unified agent. The Dutch book justification for Bayesian reasoning is basically saying this kind of updating is the only way to not have the agent’s actions on Monday work at cross purposes, at least a little, to the agent’s actions on Tuesday.

Embedded agents are non-Bayesian. And non-Bayesian agents tend to get into wars with their future selves.

The 2nd and 4th paragraphs here are clearly false for reflective AIXI. And the 2nd paragraph implies that embedded agents are definitionally resource-limited. There is a true and important sense in which reflective AIXI can be "embedded" -- that was the point of coming up with it! -- but the Embedded Agency sequence seems to be excluding this kind of case when it talks about embedded agents. This strikes me as something I'd like to see clarified by the authors of the sequence, actually.

I think the difference may be that we talk about "a theory of rationality for embedded agents," we could mean "a theory that has consequences for agents equally powerful to it," or we could mean something more like "a theory that has consequences for agents of arbitrarily low power." Reflective AIXI (as a theory of rationality) explains why reflective AIXI (as an agent) is optimally designed, but it can't explain why a real-world robot might or might not be optimally designed.

Comment by nostalgebraist on When does rationality-as-search have nontrivial implications? · 2018-11-05T20:44:08.796Z · score: 5 (3 votes) · LW · GW

My argument isn’t specialized to AIXI — note that I also used LIA as an example, which has a weaker R along with a weaker S.

Likewise, if you put AIXI in a world whose parts can do uncomputable things (like AIXI), you have the same pattern one level up. Your S is stronger, with uncomptable strategies, but by the same token, you lose AIXI’s optimality. It’s only searching over computable strategies, and you have to look at all strategies (including the uncomputable ones) to make sure you’re optimal. This leads to a rule R distinct from AIXI, just as AIXI is distinct from a Turing machine.

I guess it’s conceivable that this hits a fixed point at this level or some higher level? That would be abstractly interesting but not very relevant to embeddedness in the kind of world I think I inhabit.

Comment by nostalgebraist on Embedded World-Models · 2018-11-05T16:55:12.672Z · score: 13 (5 votes) · LW · GW
OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.

I've written previously about this kind of argument -- see here (scroll down to the non-blockquoted text). tl;dr we can often describe the same optimum in multiple ways, with each way giving us a different series that approximates the optimum in the limit. Whether any one series does well or poorly when truncated to N terms can't be explained by saying "it's a truncation of the optimum," since they all are; these truncations properties are facts about the different series, not about the optimum. I illustrate with different series expansions for .

Furthermore, it seems like there's a pattern where, the more general the algorithmic problem you want to solve is, the more your solution is compelled to resemble some sort of brute-force search.

You may be right, and there are interesting conversations to be had about when solutions will tend to look like search and when they won't. But this doesn't feel like it really addresses my argument, which is not about "what kind of algorithm should you use" but about the weirdness of the injunction to optimize over a space containing every procedure you could ever do, including all of the optimization procedures you could ever do. There is a logical / definitional weirdness here that can't be resolved by arguments about what sorts of (logically / definitionally unproblematic) algorithms are good or bad in what domains.

Comment by nostalgebraist on Embedded World-Models · 2018-11-04T21:31:22.880Z · score: 18 (9 votes) · LW · GW

This post feels quite similar to things I have written in the past to justify my lack of enthusiasm about idealizations like AIXI and logically-omniscient Bayes. But I would go further: I think that grappling with embeddedness properly will inevitably make theories of this general type irrelevant or useless, so that "a theory like this, except for embedded agents" is not a thing that we can reasonably want. To specify what I mean, I'll use this paragraph as a jumping-off point:

Embedded agents don’t have the luxury of stepping outside of the universe to think about how to think. What we would like would be a theory of rational belief for situated agents which provides foundations that are similarly as strong as the foundations Bayesianism provides for dualistic agents.

Most "theories of rational belief" I have encountered -- including Bayesianism in the sense I think is meant here -- are framed at the level of an evaluator outside the universe, and have essentially no content when we try to transfer them to individual embedded agents. This is because these theories tend to be derived in the following way:

  • We want a theory of the best possible behavior for agents.
  • We have some class of "practically achievable" strategies , which can actually be implemented by agents. We note that an agent's observations provide some information about the quality of different strategies . So if it were possible to follow a rule like "find the best given your observations, and then follow that ," this rule would spit out very good agent behavior.
  • Usually we soften this to a performance-weighted average rather than a hard argmax, but the principle is the same: if we could search over all of , the rule that says "do the search and then follow what it says" can be competitive with the very best . (Trivially so, since it has access to the best strategies, along with all the others.)
  • But usually . That is, the strategy "search over all practical strategies and follow the best ones" is not a practical strategy. But we argue that this is fine, since we are constructing a theory of ideal behavior. It doesn't have to be practically implementable.

For example, in Solomonoff, is defined by computability while is allowed to be uncomputable. In the LIA construction, is defined by polytime complexity while is allowed to run slower than polytime. In logically-omniscient Bayes, finite sets of hypotheses can be manipulated in a finite universe but the full Boolean algebra over hypotheses generally cannot.

I hope the framework I've just introduced helps clarify what I find unpromising about these theories. By construction, any agent you can actually design and run is a single element of (a "practical strategy"), so every fact about rationality that can be incorporated into agent design gets "hidden inside" the individual , and the only things you can learn from the "ideal theory" are things which can't fit into a practical strategy.

For example, suppose (reasonably) that model averaging and complexity penalties are broadly good ideas that lead to good results. But all of the model averaging and complexity penalization that can be done computably happens inside some Turing machine or other, at the level "below" Solomonoff. Thus Solomonoff only tells you about the extra advantage you can get by doing these things uncomputably. Any kind of nice Bayesian average over Turing machines that can happen computably is (of course) just another Turing machine.

This also explains why I find it misleading to say that good practical strategies constitute "approximations to" an ideal theory of this type. Of course, since just says to follow the best strategies in , if you are following a very good strategy in your behavior will tend to be close to that of . But this cannot be attributed to any of the searching over that does, since you are not doing a search over ; you are executing a single member of and ignoring the others. Any searching that can be done practically collapses down to a single practical strategy, and any that doesn't is not practical. Concretely, this talk of approximations is like saying that a very successful chess player "approximates" the rule "consult all possible chess players, then weight their moves by past performance." Yes, the skilled player will play similarly to this rule, but they are not following it, not even approximately! They are only themselves, not any other player.

Any theory of ideal rationality that wants to be a guide for embedded agents will have to be constrained in the same ways the agents are. But theories of ideal rationality usually get all of their content by going to a level above the agents they judge. So this new theory would have to be a very different sort of thing.

Comment by nostalgebraist on An Untrollable Mathematician Illustrated · 2018-11-04T18:06:45.975Z · score: 15 (6 votes) · LW · GW

This prior isn’t trollable in the original sense, but it is trollable in a weaker sense that still strikes me as important. Since must sum to 1, only finitely many sentences can have for a given . So we can choose some finite set of “important sentences” and control their oscillations in a practical sense, but if there’s any such that we think oscillations across the range are a bad thing, all but finitely many sentences can exhibit this bad behavior.

It seems especially bad that we can only prevent “up-to- trolling” for finite sets of sentences, since in PA (or whatever) there are plenty of countable sets of sentences that seem “essentially the same” (like the ones you get from an induction argument), and it feels very unnatural to choose finite subsets of these and distinguish them from the others, even (or especially?) if we pretend we have no prior knowledge beyond the axioms.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-11-04T00:18:41.837Z · score: 3 (2 votes) · LW · GW

To quote Abram Demski in “All Mathematicians are Trollable”:

The main concern is not so much whether GLS-coherent mathematicians are trollable as whether they are trolling themselves. Vulnerability to an external agent is somewhat concerning, but the existence of misleading proof-orderings brings up the question: are there principles we need to follow when deciding what proofs to look at next, to avoid misleading ourselves?

My concern is not with the dangers of an actual adversary, it’s with the wild oscillations and extreme confidences that can arise even when logical facts arrive in a “fair” way, so long as it is still possible to get unlucky and experience a “clump” of successive observations that push P(A) way up or down.

We should expect such clumps sometimes unless the observation order is somehow specially chosen to discourage them, say via the kind of “principles” Demski wonders about.

One can also prevent observation order from mattering by doing what the Eisenstat prior does: adopt an observation model that does not treat logical observations as coming from some fixed underlying reality (so that learning “B or ~A” rules out some ways A could have been true), but as consistency-constrained samples from a fixed distribution. This works as far as it goes, but is hard to reconcile with common intuitions about how e.g. P=NP is unlikely because so many “ways it could have been true” have failed (Scott Aaronson has a post about this somewhere, arguing against Lubos Motl who seems to think like the Eisenstat prior), and more generally with any kind of mathematical intuition — or with the simple fact that the implications of axioms are fixed in advance and not determined dynamically as we observe them. Moreover, I don’t know of any way to (approximately) apply this model in real-world decisions, although maybe someone will come up with one.

This is all to say that I don’t think there is (yet) any standard Bayesian answer to the problem of self-trollability. It’s a serious problem and one at the very edge of current understanding, with only some partial stabs at solutions available.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-08T17:03:05.132Z · score: 8 (5 votes) · LW · GW

Ah, yeah, you're right that it's possible to do this. I'm used to thinking in the Kolmogorov picture, and keep forgetting that in the Jaynesian propositional logic picture you can treat material conditionals as contingent facts. In fact, I went through the process of realizing this in a similar argument about the same post a while ago, and then forgot about it in the meantime!

That said, I am not sure what this procedure has to recommend it, besides that it is possible and (technically) Bayesian. The starting prior, with independence, does not really reflect our state of knowledge at any time, even at the time before we have "noticed" the implication(s). For, if we actually write down that prior, we have an entry in every cell of the truth table, and if we inspect each of those cells and think "do I really believe this?", we cannot answer the question without asking whether we know facts such as A => B -- at which point we notice the implication!

It seems more accurate to say that, before we consider the connection of A to B, those cells are "not even filled in." The independence prior is not somehow logically agnostic; it assigns a specific probability to the conditional, just as our posterior does, except that in the prior that probability is, wrongly, not one.

Okay, one might say, but can't this still be a good enough place to start, allowing us to converge eventually? I'm actually unsure about this, because (see below) the logical updates tend to push the probabilities of the "ends" of a logical chain further towards 0 and 1; at any finite time the distribution obeys Cromwell's Rule, but whether it converges to the truth might depend on the way in which we take the limit over logical and empirical updates (supposing we do arbitrarily many of each type as time goes on).

I got curious about this and wrote some code to do these updates with arbitrary numbers of variables and arbitrary conditionals. What I found is that as we consider longer chains A => B => C => ..., the propositions at one end get pushed to 1 or 0, and we don't need very long chains for this to get extreme. With all starting probabilities set to 0.7 and three variables 0 => 1 => 2, the probability of variable 2 is 0.95; with five variables the probability of the last one is 0.99 (see the plot below). With ten variables, the last one reaches 0.99988. We can easily come up with long chains in the California example or similar, and following this procedure would lead us to absurdly extreme confidence in such examples.

I've also given a second plot below, where all the starting probabilities are 0.5. This shows that the growing confidence does not rely on an initial hunch one way or the other; simply updating on the logical relationships from initial neutrality (plus independences) pushes us to high confidence about the ends of the chain.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-07T16:53:28.635Z · score: 7 (5 votes) · LW · GW

Two comments:

1. You seem to be suggesting that the standard Bayesian framework handles logical uncertainty as a special case. (Here we are not exactly "uncertain" about sentences, but we have to update on their truth from some prior that did not account for it, which amounts to the same thing.) If this were true, the research on handling logical uncertainty through new criteria and constructions would be superfluous. I haven't actually seen a proposal like this laid out in detail, but I think they've been proposed and found wanting, so I'll be skeptical at least until I'm shown the details of such a proposal.

(In particular, this would need to involve some notion of conditional probabilities like P(A | A => B), and perhaps priors like P(A => B), which are not a part of any treatment of Bayes I've seen.)

2. Even if this sort of thing does work in principle, it doesn't seem to help in the practical case at hand. We're now told to update on "noticing" A => B by using objects like P(A | A => B), but these too have to be guessed using heuristics (we don't have a map of them either), so it inherits the same problem it was introduced to solve.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-07T02:29:30.400Z · score: 5 (3 votes) · LW · GW
You assume a creature that can't see all logical consequences of hypotheses [...] Then you make it realize new facts about logical consequences of hypotheses

This is not quite what is going on in section 7b. The agent isn't learning any new logical information. For instance, in jadagul's "US in 2100" example, all of the logical facts involved are things the agent already knows. " 'California is a US state in 2100' implies 'The US exists in 2100' " is not a new fact, it's something we already knew before running through the exercise.

My argument in 7b is not really about updating -- it's about whether probabilities can adequately capture the agent's knowledge, even at a single time.

This is in a context (typical of real decisions) where:

  • the agent knows a huge number of logical facts, because it can correctly interpret hypotheses written in a logically transparent way, like "A and B," and because it knows lots of things about subsets in the world (like US / California)
  • but, the agent doesn't have the time/memory to write down a "map" of every hypothesis connected by these facts (like a sigma-algebra). For example, you can read an arbitrary string of hypotheses "A and B and C and ..." and know that this implies "A", "A and C", etc., but you don't have in your mind a giant table containing every such construction.

So the agent can't assign credences/probabilities simultaneously to every hypothesis on that map. Instead, they have some sort of "credence generator" that can take in a hypothesis and output how plausible it seems, using heuristics. In their raw form, these outputs may not be real numbers (they will have an order, but may not have e.g. a metric).

If we want to use Bayes here, we need to turn these raw credences into probabilities. But remember, the agent knows a lot of logical facts, and via the probability axioms, these all translate to facts relating probabilities to one another. There may not be any mapping from raw credence-generator-output to probabilities that preserves all of these facts, and so the agent's probabilities will not be consistent.

To be more concrete about the "credence generator": I find that when I am asked to produce subjective probabilities, I am translating them from internal representations like

  • Event A feels "very likely"
  • Event B, which is not logically entailed by A or vice versa, feels "pretty likely"
  • Event (A and B) feels "pretty likely"

If we demand that these map one-to-one to probabilities in any natural way, this is inconsistent. But I don't think it's inconsistent in itself; it just reflects that my heuristics have limited resolution. There isn't a conjunction fallacy here because I'm not treating these representations as probabilities -- but if I decide to do so, then I will have a conjunction fallacy! If I notice this happening, I can "plug the leak" by changing the probabilities, but I will expect to keep seeing new leaks, since I know so many logical facts, and thus there are so many consequences of the probability axioms that can fail to hold. And because I expect this to happen going forward, I am skeptical now that my reported probabilities reflect my actual beliefs -- not even approximately, since I expect to keep deriving very wrong things like an event being impossible instead of likely.

None of this is meant disapprove of using probability estimates to, say, make more grounded estimates of cost/benefit in real-world decisions. I do find that useful, but I think it is useful for a non-Bayesian reason: even if you don't demand a universal mapping from raw credences, you can get a lot of value out of saying things like "this decision isn't worth it unless you think P(A) > 97%", and then doing a one-time mapping of that back onto a raw credence, and this has a lot of pragmatic value even if you know the mappings will break down if you push them too hard.

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-06T01:38:28.877Z · score: 9 (5 votes) · LW · GW

If I understand your objection correctly, it's one I tried to answer already in my post.

In short: Bayesianism is normative for problems to you can actually state in its formalism. This can be used as an argument for at least trying to state problems in its formalism, and I do think this is often a good idea; many of the examples in Jaynes' book show the value of doing this. But when the information you have actually does not fit the requirements of the formalism, you can only use it if you get more information (costly, sometimes impossible) or forget some of what you know to make the rest fit. I don't think Bayes normatively tells you to do those kinds of things, or at least that would require a type of argument different from the usual Dutch Books etc.

Using the word "brain" there was probably a mistake. This is only about brains insofar as it's about the knowledge actually available to you in some situation, and the same idea applies to the knowledge available to some robot you are building, or some agent in a hypothetical decision problem (so long as it is a problem with the same property, of not fitting well into the formalism without extra work or forgetting).

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-06T01:25:16.679Z · score: 8 (5 votes) · LW · GW

I don't disagree with any of this. But if I understand correctly, you're only arguing against a very strong claim -- something like "Bayes-related results cannot possibly have general relevance for real decisions, even via 'indirect' paths that don't rely on viewing the real decisions in a Bayesian way."

I don't endorse that claim, and would find it very hard to argue for. I can imagine virtually any mathematical result playing some useful role in some hypothetical framework for real decisions (although I would be more surprised in some cases than others), and I can't see why Bayesian stuff should be less promising in that regard than any arbitrarily chosen piece of math. But "Bayes might be relevant, just like p-adic analysis might be relevant!" seems like damning with faint praise, given the more "direct" ambitions of Bayes as advocated by Jaynes and others.

Is there a specific "indirect" path for the relevance of Bayes that you have in mind here?

Comment by nostalgebraist on nostalgebraist - bayes: a kinda-sorta masterpost · 2018-09-05T20:22:03.257Z · score: 19 (7 votes) · LW · GW

I disagree that this answers my criticisms. In particular, my section 7 argues that it's practically unfeasible to even write down most practical belief / decision problems in the form that the Bayesian laws require, so "were the laws followed?" is generally not even a well-defined question.

To be a bit more precise, the framework with a complete hypothesis space is a bad model for the problems of interest. As I detailed in section 7, that framework assumes that our knowledge of hypotheses and the logical relations between hypotheses are specified "at the same time," i.e. when we know about a hypothesis we also know all its logical relations to all other hypotheses, and when we know (implicitly) about a logical relation we also have access (explicitly) to the hypotheses it relates. Not only is this false in many practical cases, I don't even know of any formalism that would allow us to call it "approximately true," or "true enough for the optimality theorems to carry over."

(N.B. as it happens, I don't think logical inductors fix this problem. But the very existence of logical induction as a research area shows that this is a problem. Either we care about the consequences of lacking logical omniscience, or we don't -- and apparently we do.)

It's sort of like quoting an optimality result given access to some oracle, when talking about a problem without access to that oracle. If the preconditions of a theorem are not met by the definition of a given decision problem, "meet those preconditions" cannot be part of a strategy for that problem. "Solve a different problem so you can use my theorem" is not a solution to the problem as stated.

Importantly, this is not just an issue of "we can't do perfect Bayes in practice, but if we were able, it'd be better." Obtaining the kind of knowledge representation assumed by the Bayesian laws has computational / resource costs, and in any real decision problem, we want to minimize these. If we're handed the "right" knowledge representation by a genie, fine, but if we are talking about choosing to generate it, that in itself is a decision with costs.

As a side point, I am also skeptical of some of the optimality results.

Comment by nostalgebraist on Optimization Amplifies · 2018-08-19T20:18:19.576Z · score: 2 (2 votes) · LW · GW

I agree. When I think about the "mathematician mindset" I think largely about the overwhelming interest in the presence or absence, in some space of interest, of "pathological" entities like the Weierstrass function. The truth or falsehood of "for all / there exists" statements tend to turn on these pathologies or their absence.

How does this relate to optimization? Optimization can make pathological entities more relevant, if

(1) they happen to be optimal solutions, or

(2) an algorithm that ignores them will be, for that reason, insecure / exploitable.

But this is not a general argument about optimization, it's a contingent claim that is only true for some problems of interest, and in a way that depends on the details of those problems.

And one can make a separate argument that, when conditions like 1-2 do not hold, a focus on pathological cases is unhelpful: if a statement "fails in practice but works in theory" (say by holding except on a set of sufficiently small measure as to always be dominated by other contributions to a decision problem, or only for decisions that would be ruled out anyway for some other reason, or over the finite range relevant for some calculation but not in the long or short limit), optimization will exploit its "effective truth" whether or not you have noticed it. And statements about "effective truth" tend to be mathematically pretty uninteresting; try getting an audience of mathematicians to care about a derivation that rocket engineers can afford to ignore gravitational waves, for example.

Comment by nostalgebraist on Hero Licensing · 2017-12-01T23:24:12.896Z · score: 10 (4 votes) · LW · GW

I think the arguments here apply much better to the AGI alignment case than to the case of HPMOR. The structure of the post suggests (? not sure) that HPMOR is meant to be the "easier" case, the one in which the reader will assent to the arguments more readily, but it didn't work that way on me.

In both cases, we have some sort of metric for what it would mean to succeed, and (perhaps competing) inside- and outside-view arguments for how highly we should expect to score on that metric. (More precisely, what probabilities we should assign to achieving different scores.) In both cases, this post tends to dismiss facts which involve social status as irrelevant to the outside view.

But what if our success metric depends on some facts which involve social status? Then we definitely shouldn't ignore these facts, (even) in the inside view. And this is the situation we are in with HPMOR, at least, if perhaps less so with AGI alignment.

There are some success metrics for HPMOR mentioned in this post which can be evaluated largely without reference to status stuff (like "has it conveyed the experience of being rational to many people?"). But when specific successes -- known to have been achieved in the actual world -- come up, many of them are clearly related to status. If you want to know whether your fic will become one of the most reviewed HP fanfics on a fanfiction site, then it matters how it will be received by the sorts of people who review HP fanfics on those sites -- including their status hierarchies. (Of course, this will be less important if we expect most of the review-posters to be people who don't read HP fanfic normally and have found out about the story through another channel, but its importance is always nonzero, and very much so for some hypothetical scenarios.)

TBH, I don't understand why so much of this post focuses on pure popularity metrics for HPMOR, ones that don't capture whether it is having the intended effect on readers. (Even something like "many readers consider it the best book they've ever read" does not tell you much without specifying more about the readership; consider that if you were optimizing for this metric, you would have an incentive to select for readers who have read as few books as possible.)

I guess the idea may be that it is possible to surprise someone like Pat by hitting a measurable indictor of high status (because Pat thinks that's too much of a status leap relative to the starting position), where Pat would be less surprised by HPMOR hitting idiosyncratic goals that are not common in HP fanfiction (and thus are not high status to him). But this pattern of surprise levels seems obviously correct to me! If you are trying to predict an indicator of status in a community, you should use information about the status system in that community in your inside view. (And likewise, if the indicator is unrelated to status, you may be able to ignore status information.)

In short, this post condemns using status-related facts for forecasting, even when they are relevant (because we are forecasting other status-related facts). I don't mean the next statement as Bulverism, but as a hopefully useful hypothesis: it seems possible that the concept of status regulation has encouraged this confusion, by creating a pattern to match to ("argument involving status and the existing state of a field, to the effect that I shouldn't expect to be capable of something"), even when some arguments matching that pattern are good arguments.

Comment by nostalgebraist on Against Shooting Yourself in the Foot · 2017-11-30T00:33:11.429Z · score: 20 (10 votes) · LW · GW

I just wrote a long post on my tumblr about this sequence, which I am cross-posting here as a comment on the final post. (N.B. my tone is harsher and less conversational than it would have been if I had thought of it as a comment while writing.)

I finally got around to reading these posts.  I wasn’t impressed with them.

The basic gist is something like:

“There are well-established game-theoretic reasons why social systems (governments, academia, society as a whole, etc.) may not find, or not implement, good ideas even when they are easy to find/implement and the expected benefits are great.  Therefore, it is sometimes warranted to believe you’ve come up with a good, workable idea which ‘experts’ or ‘society’ have not found/implemented yet.  You should think about the game-theoretic reasons why this might or might not be possible, on a case-by-case basis; generalized maxims about ‘how much you should trust the experts’ and the like are counterproductive.”

I agree with this, although it also seems fairly obvious to me.  It’s possible that Yudkowsky is really pinpointing a trend (toward an extreme “modest epistemology”) that sounds obviously wrong once it’s pinned down, but is nonetheless pervasive; if so, I guess it’s good to argue against it, although I haven’t encountered it myself.

But the biggest reason I was not impressed is that Yudkowsky mostly ignores an which strikes me as crucial.  He makes a case that, given some hypothetically good idea, there are reasons why experts/society might not find and implement it.  But as individuals, what we see are not ideas known to be good.

What we see are ideas that look good, according to the models and arguments we have right now.  There is some cost (in time, money, etc.) associated with testing each of these ideas.  Even if there are many untried good ideas, it might still be the case that these are a vanishing small fraction of ideas that look good before they are tested.  In that case, the expected value of “being an experimenter” (i.e. testing lots of good-looking ideas) could easily be negative, even though there are many truly good, untested ideas.

To me, this seems like the big determining factor for whether individuals can expect to regularly find and exploit low-hanging fruit.

The closest Yudkowsky comes to addressing this topic is in sections 4-5 of the post “Living in an Inadequate World.”  There, he’s talking about the idea that even if many things are suboptimal, you should still expect a low base rate of exploitable suboptimalities in any arbitrarily/randomly chosen area.  He analogizes this to finding exploits in computer code:

Computer security professionals don’t attack systems by picking one particular function and saying, “Now I shall find a way to exploit these exact 20 lines of code!” Most lines of code in a system don’t provide exploits no matter how hard you look at them. In a large enough system, there are rare lines of code that are exceptions to this general rule, and sometimes you can be the first to find them. But if we think about a random section of code, the base rate of exploitability is extremely low—except in really, really bad code that nobody looked at from a security standpoint in the first place.
Thinking that you’ve searched a large system and found one new exploit is one thing. Thinking that you can exploit arbitrary lines of code is quite another.

This isn’t really the same issue I’m talking about – in the terms of this analogy, my question is “when you think you have found an exploit, but you can’t costlessly test it, how confident should you be that there is really an exploit?”

But he goes on to say something that seems relevant to my concern, namely that most of the time you think you have found an exploit, you won’t be able to usefully act on it:

Similarly, you do not generate a good startup idea by taking some random activity, and then talking yourself into believing you can do it better than existing companies. Even where the current way of doing things seems bad, and even when you really do know a better way, 99 times out of 100 you will not be able to make money by knowing better. If somebody else makes money on a solution to that particular problem, they’ll do it using rare resources or skills that you don’t have—including the skill of being super-charismatic and getting tons of venture capital to do it.
To believe you have a good startup idea is to say, “Unlike the typical 99 cases, in this particular anomalous and unusual case, I think I can make a profit by knowing a better way.”
The anomaly doesn’t have to be some super-unusual skill possessed by you alone in all the world. That would be a question that always returned “No,” a blind set of goggles. Having an unusually good idea might work well enough to be worth trying, if you think you can standardly solve the other standard startup problems. I’m merely emphasizing that to find a rare startup idea that is exploitable in dollars, you will have to scan and keep scanning, not pursue the first “X is broken and maybe I can fix it!” thought that pops into your head.
To win, choose winnable battles; await the rare anomalous case of, “Oh wait, that could work.”

The problem with this is that many people already include “pick your battles” as part of their procedure for determining whether an idea seems good.  People are more confident in their new ideas in areas where they have comparative advantages, and in areas where existing work is especially bad, and in areas where they know they can handle the implementation details (“the other standard startup problems,” in EY’s example).

Let’s grant that all of that is already part of the calculus that results in people singling out certain ideas as “looking good” – which seems clearly true, although doubtlessly many people could do better in this respect.  We still have no idea what fraction of good-looking ideas are actually good.

Or rather, I have some ideas on the topic, and I’m sure Yudkowsky does too, but he does not provide any arguments to sway anyone who is pessimistic on this issue.  Since optimism vs. pessimism on this issue strikes me as the one big question about low-hanging fruit, this leaves me feeling that the topic of low-hanging fruit has not really been addressed.

Yudkowsky mentions some examples of his own attempts to act upon good-seeming ideas.  To his credit, he mentions a failure (his ketogenic meal replacement drink recipe) as well as a success (stringing up 130 light bulbs around the house to treat his wife’s Seasonal Affective Disorder).  Neither of these were costless experiments.  He specifically mentions the monetary cost of testing the light bulb hypothesis:

The systematic competence of human civilization with respect to treating mood disorders wasn’t so apparent to me that I considered it a better use of resources to quietly drop the issue than to just lay down the ~$600 needed to test my suspicion.

His wife has very bad SAD, and the only other treatment that worked for her cost a lot more than this.  Given that the hypothesis worked, it was clearly a great investment.  But not all hypotheses work.  So before I do the test, how am I to know whether it’s worth $600?  What if the cost is greater than that, or the expected benefit less?  What does the right decision-making process look like, quantitatively?

Yudkowsky’s answer is that you can tell when good ideas in an area are likely to have been overlooked by analyzing the “adequacy” of the social structures that generate, test, and implement ideas.  But this is only one part of the puzzle.  At best, it tells us P(society hasn’t done it yet | it’s good).  But what we need is P(it’s good | society hasn’t done it yet).  And to get to one from the other, we need the prior probability of “it’s good,” as a function of the domain, my own abilities, and so forth.  How can we know this?  What if there are domains where society is inadequate yet good ideas are truly rare, and domains where society is fairly adequate but good ideas as so plentiful as to dominate the calculation?

In an earlier conversation about low-hanging fruit, tumblr user @argumate brought up the possibility that low-hanging fruit are basically impossible to find beforehand, but that society finds them by funding many different attempts and collecting on the rare successes.  That is, every individual attempt to pluck fruit is EV-negative given risk aversion, but a portfolio of such attempts (such as a venture capitalist’s portfolio) can be net-positive given risk aversion, because with many attempts the probability of one big success that pays for the rest (a “unicorn”) goes up.  It seems to me like this is plausible.

Let me end on a positive note, though.  Even if the previous paragraph is accurate, it is a good thing for society if more individuals engage in experimentation (although it is a net negative for each of those individuals).  Because of this, the individual’s choice to experiment can still be justified on other terms – as a sort of altruistic expenditure, say, or as a way of kindling hope in the face of personal maladies like SAD (in which case it is like a more prosocial version of gambling).

Certainly there is something emotionally and aesthetically appealing about a resurgence of citizen science – about ordinary people looking at the broken, p-hacked, perverse-incentived edifice of Big Science and saying “empiricism is important, dammit, and if The Experts won’t do it, we will.” (There is precedent for this, and not just as a rich man’s game – there is a great chapter in The Intellectual Life of the British Working Classes about widespread citizen science efforts in the 19th C working class.)  I am pessimistic about whether my experiments, or yours, will bear fruit often enough to make the individual cost-benefit analysis work out, but that does not mean they should not be done.  Indeed, perhaps they should.

Comment by nostalgebraist on The Craft & The Community - A Post-Mortem & Resurrection · 2017-11-05T18:08:29.294Z · score: 3 (1 votes) · LW · GW

Thanks -- this is informative and I think it will be useful for anyone trying to decide what to make of your project.

I have disagreements about the "individual woman" example but I'm not sure it's worth hashing it out, since it gets into some thorny stuff about persuasion/rhetoric that I'm sure we both have strong opinions on.

Regarding MIRI, I want to note that although the organization has certainly become more competently managed, the more recent OpenPhil review included some very interesting and pointed criticism of the technical work, which I'm not sure enough people saw, as it was hidden in a supplemental PDF. Clearly this is not the place to hash out those technical issues, but they are worth noting, since the reviewer objections were more "these results do not move you toward you stated goal in this paper" than "your stated goal is pointless or quixotic," so if true they are identifying a rationality failure.

Comment by nostalgebraist on Normative assumptions: answers, emotions, and narratives · 2017-11-04T21:22:39.917Z · score: 6 (2 votes) · LW · GW

This seems like a very good perspective to me.

It made me think about the way that classic biases are often explained by constructing money pumps. A money pump is taken to be a clear, knock-down demonstration of irrationality, since "clearly" no one would want to lose arbitrarily large amounts of money. But in fact any money pump could be rational if the agent just enjoyed making the choices involved. If I greatly enjoyed anchoring on numbers presented to me, I might well pay a lot of extra money to get anchored; this would be like buying a kind of enjoyable product. Likewise someone might just get a kick out of making choices in intransitive loops, or hyperbolic discounting, or whatever. (In the reverse direction, if you didn't know I enjoyed some consumer good, you might think I was getting "money pumped" by paying for it again and again.)

So there is a missing step here, and to supply the step we need psychology. The reason these biases are biases and not values is "those aren't the sort of things we care about," but to formalize that, we need an account of "the sort of things we care about" which, as you say, can't be solved for from policy data alone.

Comment by nostalgebraist on The Craft & The Community - A Post-Mortem & Resurrection · 2017-11-04T19:58:13.996Z · score: 23 (8 votes) · LW · GW

This is somewhat closer to what I was asking for, but still mostly about group dynamics rather than engineering (or rather, the analogue of "engineering" on the other side of the rocket analogy). But I take your point that it's hard to talk about engineering if the team culture was so bad that no one ever tried any engineering.

I do think that it would be very helpful for you to give more specifics, even if they're specifics like "this person/organization is doing something that is stupid and irrelevant for these reasons." (If the engineers spent all their time working on their crackpot perpetual motion machine, describe that.)

Basically, I'm asking you to name names (of people/orgs), and give many more specifics about the names you have named (CFAR). This would require you to be less diplomatic than you have been so far, and may antagonize some people. But look, you're trying to get people to move to a different city (in most cases, a different country) to be part of your new project. You've mostly motivated that project by saying, in broad terms, that currently existing rationalists are doing most everything wrong. Moving to a different country is already a risky move, and the EV goes down sharply once some currently-existing-rationalist considers that the founder may well fundamentally disagree with their goals and assumptions. The only way to make this look positive-EV to very many individuals would be to show much more explicitly which sorts of currently-existing-rationalists you disapprove of, so that individuals are able to say "okay, that's not me."

See e.g. this bit from one of your other comments

This depends entirely on how you measure it. If I was to throw all other goals under the bus for the sake of proving you wrong, I'm pretty sure I could find enough women to nod along to a watered down version. If instead we're going for rationalist Rationalists then a lot of the fandom people wouldn't make the cut and I suspect if we managed to outdo tech, we would be beating The Bay.

Consider an individual woman trying to decide whether to move to Manchester from Berkeley. As it stands, you have a not explicitly stated theory of what makes the good kind of rationalist, such that many/most female Berkeley rats do not qualify. Without further information, the woman will conclude that she probably does not qualify, in which case she's definitely not going to move. The only way to fix this is to articulate the theory directly so individuals can check themselves against it.

Comment by nostalgebraist on The Craft & The Community - A Post-Mortem & Resurrection · 2017-11-04T01:33:29.091Z · score: 17 (6 votes) · LW · GW

"Most startups fail" is important to keep in mind here, and I now realize my previous comment implied more focus on why MetaMed specifically failed than is warranted. I still stand by the "build a rocket" analogy, but it should be applied broadly, and not just to the highest-profile projects (and not just to projects that actually materialized enough to fail).

Is there some sort of list of Craft-and-Community-relevant projects that have been attempted since 2009, besides just the meetup groups? If not, should there be one?

Comment by nostalgebraist on The Craft & The Community - A Post-Mortem & Resurrection · 2017-11-03T17:41:25.376Z · score: 26 (9 votes) · LW · GW

I am glad that this topic is being discussed. But IMO, this post contains too much about external factors that might have impeded the Craft-and-Community project, and not enough on what project work was done, why that work didn't succeed, and/or why it wasn't enough.

There have been a number of rationalist-branded organizations that tried to spread, develop, or apply LW-rationality. The main examples I have in mind are MetaMed and CFAR. My ideal postmortem would include a lot of words about why MetaMed failed, whether CFAR failed, whether the results in either case are surprising in hindsight, etc. This post doesn't mention MetaMed, and only mentions CFAR briefly. And while I share your negative assesssment of CFAR, you don't talk in any detail about how you came to this assessment or what lessons might be learned from it.

While it may be true that cultural, economic, etc. factors indirectly caused the Craft-and-Community project to fail, there is an intermediate causal layer, the actual things people did (or avoided doing) to further the project. If your boss tells you to build a rocket by next month, and next month there is no rocket, and your boss asks why, you can say things like "we were thinking about the problem in the wrong way" or "our team has a dysfunctional dynamic," and these may well be true facts, but it is also important to address what you actually did, and why it didn't produce the desired rocket.

I wrote more about this on tumblr, but my own suspicion is that the biggest challenges to face in building a world-improving community are organizational, and the Craft-and-Community project failed because LW-rationality was almost entirely about improving individual judgment. Hence, we shouldn't have expected to get results back then, and we shouldn't expect results now, (at least) until someone does promising work collecting/systemizing/testing management advice, or something in that general area.

Comment by nostalgebraist on One-Magisterium Bayes · 2017-06-30T21:39:27.555Z · score: 6 (6 votes) · LW · GW

As the author the post you linked in the first paragraph, I may be able to provide some useful context, at least for that particular post.

Arguments for and against Strong Bayesianism have been a pet obsession of mine for a long time, and I've written a whole bunch about them over the years. (Not because I thought it was especially important to do so, just because I found it fun.) The result is that there are a bunch of (mostly) anti-Bayes arguments scattered throughout several years of posts on my tumblr. For quite a while, I'd had "put a bunch of that stuff in a single place" on my to-do list, and I wrote that post just to check that off my to-do list. Almost none of the material in there is new, and nothing in there would surprise anyone who had been keeping up with the Bayes-related posts on my tumblr. Writing the post was housekeeping, not nailing 95 theses on a church door.

As you might expect, I disagree with a number of the more specific/technical claims you've made in this post, but I am with you in feeling like these arguments are retreading old ground, and I'm at the point where writing more words on the internet about Bayes has mostly stopped being fun.

It's also worth noting that my relation to the rationalist community is not very goal-directed. I like talking to rationalists, I do it all the time on tumblr and discord and sometimes in meatspace, and I find all the big topics (including AGI stuff) fun to talk about. I am not interested in pushing the rationalist community in one direction or another; if I argue about Bayes or AGI, it's in order to have fun and/or because I value knowledge and insight (etc.) in general, not because I am worried that rationalists are "wasting time" on those things when they could be doing some other great thing I want them to do. Stuff like "what does it even mean to be a non-Bayesian rationalist?" is mostly orthogonal to my interests, since to me "rationalists" just means "a certain group of people whose members I often enjoy talking to."

Comment by nostalgebraist on In praise of gullibility? · 2015-06-19T16:47:50.429Z · score: 6 (6 votes) · LW · GW

I'm not sure if I'm understanding you correctly, but the reason why climate forecasts and meterological forecasts have different temporal ranges of validity is not that the climate models are coarser, it's that they're asking different questions.

Climate is (roughly speaking) the attractor on which the weather chaotically meanders on short (e.g. weekly) timescales. On much longer (1-100+ years) this attractor itself shifts. Weather forecasts want to determine the future state of the system itself as it evolves chaotically, which is impossible in principle after ~14 days because the system is chaotic. Climate forecasts want to track the slow shifts of the attractor. To do this, they run ensembles with slightly different initial conditions and observe the statistics of the ensemble at some future date, which is taken (via an ergodic assumption) to reflect the attractor at that date. None of the ensemble members are useful as "weather predictions" for 2050 or whatever, but their overall statistics are (it is argued) reliable predictions about the attractor on which the weather will be constrained to move in 2050 (i.e. "the climate in 2050").

It's analogous to the way we can precisely characterize the attractor in the Lorenz system, even if we can't predict the future of any given trajectory in that system because it's chaotic. (For a more precise analogy, imagine a version of the Lorenz system in which the attractor slowly changes over long time scales)

A simple way to explain the difference is that you have no idea what the weather will be in any particular place on June 19, 2016, but you can be pretty sure that in the Northern Hemisphere it will be summer in June 2016. This has nothing to do with differences in numerical model properties (you aren't running a numerical model in your head), it's just a consequence of the fact that climate and weather are two different things.

Apologies if you know all this. It just wasn't clear to me if you did from your comment, and I thought I might spell it out since it might be valuable to someone reading the thread.

Comment by nostalgebraist on Beautiful Probability · 2012-12-27T22:43:08.175Z · score: 2 (4 votes) · LW · GW

Sure, the likelihoods are the same in both cases, since A and B's probability distributions assign the same probability to any sequence that is in both of their supports. But the distributions are still different, and various functionals of them are still different -- e.g., the number of tails, the moments (if we convert heads and tails to numbers), etc.

If you're a Bayesian, you think any hypothesis worth considering can predict a whole probability distribution, so there's no reason to worry about these functionals when you can just look at the probability of your whole data set given the hypothesis. If (as in actual scientific practice, at present) you often predict functionals but not the whole distribution, then the difference in the functionals matters. (I admit that the coin example is too basic here, because in any theory about a real coin, we really would have a whole distribution.)

My point is just that there are differences between the two cases. Bayesians don't think these differences could possibly matter to the sort of hypotheses they are interested in testing, but that doesn't mean that in principle there can be no reason to differentiate between the two.

Comment by nostalgebraist on Beautiful Probability · 2012-12-27T17:51:10.895Z · score: 2 (4 votes) · LW · GW

Incidentally, Eliezer, I don't think you're right about the example at the beginning of the post. The two frequentist tests are asking distinct questions of the data, and there is not necessarily any inconsistency when we ask two different questions of the same data and get two different answers.

Suppose A and B are tossing coins. A and B both get the same string of results -- a whole bunch of heads (let's say 9999) followed by a single tail. But A got this by just deciding to flip a coin 10000 times, while B got it by flipping a coin until the first tail came up. Now suppose they each ask the question "what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?"

In A's case the answer is of course very small; most strings of 10000 flips have many more than one tail. In B's case the answer is of course 1; B's method ensures that exactly one tail is seen, no matter what happens. The data was the same, but the questions were different, because of the "when doing what I did" clause (since A and B did different things). Frequentist tests are often like this -- they involve some sort of reasoning about hypothetical repetitions of the procedure, and if the procedure differs, the question differs.

If we wanted to restate this in Bayesian terms, we'd have to do so by taking into account that the interpreter knows what the method is, not just what the data is, and the distributions used by a Bayesian interpreter should take this into account. For instance, one would be a pretty dumb Bayesian if one's prior for B's method didn't say you'd get one tail with probability one. The observation that's causing us to update isn't "string of data," it's "string of data produced by a given physical process," where the process is different in the two cases.

(I apologize if this has all been mentioned before -- I didn't carefully read all the comments above.)

Comment by nostalgebraist on Beautiful Probability · 2012-12-24T10:20:20.911Z · score: 4 (4 votes) · LW · GW

"Bayesianism's coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains)."

I've never understood why I should be concerned about dynamic Dutch books (which are the justification for conditionalization, i.e., the Bayesian update). I can understand how static Dutch books are relevant to finding out the truth: I don't want my description of the truth to be inconsistent. But a dynamic Dutch book (in the gambling context) is a way that someone can exploit the combination of my belief at time (t) and my belief at time (t+1) to get something out of me, which doesn't seem like it should carry over to the context of trying to find out the truth. When I want to find the truth, I simply want to have the best possible belief in the present -- at time (t+1) -- so why should "money" I've "lost" at time (t) be relevant?

Perhaps I simply want to avoid getting screwed in life by falling into the equivalents of Dutch books in real, non-gambling-related situations. But if that's the argument, it should depend on how frequently such situations actually crop up -- the mere existence of a Dutch book shouldn't matter if life is never going to make me take it. Why should my entire notion of rationality be based on avoiding one particular -- perhaps rare -- type of misfortune? On the other hand, if the argument is that falling for dynamic Dutch books constitutes "irrationality" in some direct intuitive sense (the same way that falling for static Dutch books does), then I'm not getting it.