The First Sample Gives the Most Information

post by Mark Xu (mark-xu) · 2020-12-24T20:39:04.936Z · LW · GW · 8 comments

This is a link post for https://markxu.com/first-sample

I originally heard this point made by Ben Pace in Episode 126 of the Bayesian Conspiracy Podcast. Ben claimed that he learned this from the book How to Measure Anything, but I think I identified the relevant section, and this point wasn't made explicitly.

Suppose that I came up to you and asked you for a 90% confidence interval for the weight of a wazlot. I'm guessing you would not really know where to start. However, suppose that I randomly sampled a wazlot and told you it weighed 142 grams. I'm guessing you would now have a much better idea of your 90% confidence interval (although you still wouldn't have that good a guess at the width).

In general, if you are very ignorant about something, the first instance of that thing will tell you what domain you're operating in. If you have no idea how much something weighs, knowing the weight tells you the reasonable orders of magnitude are. Things that sometimes weigh 142 grams don't typically also sometimes weigh 12 solar masses. Similarly, things that take 5 minutes don't typically also take 5 days, and things that are 5 cm long aren't typically also 5 km long.

For more abstract concepts, having a single sample allows you to locate the concept in concept space [LW · GW] by anchoring it to thing space [LW · GW]. "Redness" cannot be properly understood until it is known that "apples are red". "Functions" are incomprehensible until you know "adding one to a number" is a function. "Resources" are vague until you learn that "money is a resource".

In reality, the first sample often gives you more information than a random sample. If I ask a friend for an example of a snack, they're not going to randomly sample a snack and tell me about it; they're probably going to pick a snack that is at the center of the space of all snacks, like potato chips.

From an information-theoretic perspective, the expected amount of information gained from the first sample must be the highest. If the sampling process is independently and identically distributed, the 2nd sample is expected to be more predictable given knowledge of the first sample. There is some chance that the first sample is misleading, but the probability that it's misleading goes down the more misleading the sample is, so you don't expect the first sample to be misleading. If you're very ignorant, your best guess for the mean of a distribution is pretty close to the mean of the samples you have, even if you only have one.

This is one perspective on why asking for examples is so powerful; they typically give you the first sample, which contains the most information.

8 comments

Comments sorted by top scores.

comment by Austin Chen (austin-chen) · 2020-12-25T20:41:11.334Z · LW(p) · GW(p)

This is a really powerful concept; I can immediately think of at least two fields this applies to:

  • When you're not sure how to build a software user interface, you might think "let's run an A/B test on 1000 people and see which performs better". But you'll get 90 percent of the value just by showing it to one or two users and watching them use it, live.

  • When you're learning to cook, one of the first things they teach you is to sample your food throughout. The first sip or bite will immediately tell you how to adjust the recipe (eg add more salt, add something spicy, or a dash of vinegar)

comment by kpreid · 2020-12-29T18:09:41.088Z · LW(p) · GW(p)

I like this post and am not intending to argue against its point by the following:

I read the paragraph about orders of magnitude and immediately started thinking about whether there are good counterexamples. Here are two: wires are used in lengths from nanometers to kilometers, and computer programs as a category run for times from milliseconds to weeks (even considering only those which are intended to have a finite task and not to continue running until cancelled).

Common characteristics of these two examples are that they are one-dimensional (no “square-cube law” limits scaling) and that they are arguably in some sense the most extensible solutions to their problem domains (a wire is the form that arbitrary length electrical conductors take, and most computer programs are written in Turing-complete languages).

Perhaps the caveat is merely that “some things scale freely such that the order of magnitude is no new information and you need to look at different properties of the thing”.

comment by Ben Pace (Benito) · 2020-12-24T20:43:37.878Z · LW(p) · GW(p)

Well now I have a post to link to for this point. Thanks! :)

comment by adamShimi · 2020-12-26T12:03:49.930Z · LW(p) · GW(p)

Nice neat little post.

Maybe a caveat I would add is that when your friend gives you a sample, they probably give one from the center of their own concept space for the subject. Theirs is probably quite similar to most others, but there might be differences. Note that this isn't a problem when giving examples to clarify some of your points, because there the whole point is to transmit your concept space.

comment by noggin-scratcher · 2020-12-24T23:39:05.085Z · LW(p) · GW(p)

Game idea: give one player a category, and see how many (misleadingly non-central) examples they can provide, without giving away what the category is to the guesses of the rest of the group.

Replies from: mark-xu
comment by Mark Xu (mark-xu) · 2020-12-24T23:41:26.058Z · LW(p) · GW(p)

The game play in Decrypto, Chameleon, and Spyfall are similar to the game you just suggested.

comment by AllAmericanBreakfast · 2021-04-09T00:51:52.267Z · LW(p) · GW(p)

More precisely, the first sample gives the most information about the mean. Learning one person's income tells you a lot about incomes in general, even though incomes are heavy-tailed.

Imagine you had no prior knowledge of how wealthy people are on Earth, or even how to think about the concept of "wealth." For you, the meaning of the term is as inscrutable as the term "flargibargh." You might sample a very poor person, and think everybody's living in poverty. You might sample a middle-class person, and miss the existence of the very rich and poor. You might (unlikely) sample a billionaire and think everybody's incredibly wealthy.

However, those samples help you avoid the mistakes of thinking that wealth is commonly extremely negative, or of a gigantic magnitude (i.e. on the order of Avogadro's number). It gets you vastly closer to the mean than you might land at if you had absolutely zero knowledge of what the concept of "wealth" refers to, and didn't even know that it's a word to measure something relevant to humans (in which domain manageable numbers are common).

However, the first sample gives you no information about the distribution of the sample. As the problem above illustrates, sampling one person tells you nothing about whether wealth is distributed on a bell curve, a heavy-tailed distribution, is exactly even, has a linear distribution, or some other form.

It's very important to gain the skill of "get a sample or example" when dealing with new territory. At the same time, you need to understand what that sample does or does not tell you. Mistakenly thinking that a sample gives you information about X can lead you to make decisions based on that illusory "information," when if you'd known your ignorance better you might not have acted.

And then, of course, it's important to make sure that your sample is actually a sample of what you think it is...

comment by Liron · 2020-12-26T18:19:18.128Z · LW(p) · GW(p)

Agree. Not only is asking “what’s an example” generally highly productive, it’s about 80% as productive as asking “what are two examples”.