# How to Measure Anything

post by lukeprog · 2013-08-07T04:05:58.366Z · score: 64 (60 votes) · LW · GW · Legacy · 50 comments## Contents

Basic Ideas Applied Information Economics Step 1: Define a decision problem and the relevant variables Step 2: Determine what you know Uncertainty and calibration Simulations Step 3: Pick a variable, and compute the value of additional information for that variable Step 4: Apply the relevant measurement instrument(s) to the high-information-value variable Selecting a measurement method Sampling reality Bayes Other methods Step 5: Make a decision and act on it Phase 0: Project Preparation Phase 1: Decision Modeling Phase 2: Optimal measurements Phase 3: Decision optimization and the final recommendation Final thoughts None 50 comments

Douglas Hubbard’s *How to Measure Anything* is one of my favorite how-to books. I hope this summary inspires you to buy the book; it’s worth it.

The book opens:

Anything can be measured. If a thing can be observed in any way at all, it lends itself to some type of measurement method. No matter how “fuzzy” the measurement is, it’s still a measurement if it tells you more than you knew before. And those very things most likely to be seen as immeasurable are, virtually always, solved by relatively simple measurement methods.

The sciences have many established measurement methods, so Hubbard’s book focuses on the measurement of “business intangibles” that are important for decision-making but tricky to measure: things like management effectiveness, the “flexibility” to create new products, the risk of bankruptcy, and public image.

### Basic Ideas

A *measurement* is an observation that quantitatively reduces uncertainty. Measurements might not yield precise, certain judgments, but they *do* reduce your uncertainty.

To be measured, the *object of measurement* must be described clearly, in terms of observables. A good way to clarify a vague object of measurement like “IT security” is to ask “What is IT security, and why do you care?” Such probing can reveal that “IT security” means things like a reduction in unauthorized intrusions and malware attacks, which the IT department cares about because these things result in lost productivity, fraud losses, and legal liabilities.

*Uncertainty* is the lack of certainty: the true outcome/state/value is not known.

*Risk* is a state of uncertainty in which some of the possibilities involve a loss.

Much pessimism about measurement comes from a lack of experience making measurements. Hubbard, who is *far* more experienced with measurement than his readers, says:

- Your problem is not as unique as you think.
- You have more data than you think.
- You need less data than you think.
- An adequate amount of new data is more accessible than you think.

### Applied Information Economics

Hubbard calls his method “Applied Information Economics” (AIE). It consists of 5 steps:

- Define a decision problem and the relevant variables. (Start with the decision you need to make, then figure out which variables would make your decision easier if you had better estimates of their values.)
- Determine what you know. (Quantify your uncertainty about those variables in terms of ranges and probabilities.)
- Pick a variable, and compute the value of additional information for that variable. (Repeat until you find a variable with reasonably high information value. If no remaining variables have enough information value to justify the cost of measuring them, skip to step 5.)
- Apply the relevant measurement instrument(s) to the high-information-value variable. (Then go back to step 3.)
- Make a decision and act on it. (When you’ve done as much uncertainty reduction as is economically justified, it’s time to act!)

These steps are elaborated below.

### Step 1: Define a decision problem and the relevant variables

Hubbard illustrates this step by telling the story of how he helped the Department of Veterans Affairs (VA) with a measurement problem.

The VA was considering seven proposed IT security projects. They wanted to know “which… of the proposed investments were justified and, after they were implemented, whether improvements in security justified further investment…” Hubbard asked his standard questions: “What do you mean by ‘IT security’? Why does it matter to you? What are you observing when you observe improved IT security?”

It became clear that *nobody* at the VA had thought about the details of what “IT security” meant to them. But after Hubbard’s probing, it became clear that by “IT security” they meant a reduction in the frequency and severity of some undesirable events: agency-wide virus attacks, unauthorized system access (external or internal),unauthorized physical access, and disasters affecting the IT infrastructure (fire, flood, etc.) And each undesirable event was on the list because of specific costs associated with it: productivity losses from virus attacks, legal liability from unauthorized system access, etc.

Now that the VA knew what they meant by “IT security,” they could measure specific variables, such as the number of virus attacks per year.

### Step 2: Determine what you know

#### Uncertainty and calibration

The next step is to determine your level of uncertainty about the variables you want to measure. To do this, you can express a “confidence interval” (CI). A 90% CI is a range of values that is 90% likely to contain the correct value. For example, the security experts at the VA were 90% confident that each agency-wide virus attack would affect between 25,000 and 65,000 people.

Unfortunately, few people are well-calibrated estimators. For example in some studies, the true value lay in subjects’ 90% CIs only 50% of the time! These subjects were overconfident. For a well-calibrated estimator, the true value will lie in her 90% CI roughly 90% of the time.

Luckily, “assessing uncertainty is a general skill that can be taught with a measurable improvement.”

Hubbard uses several methods to calibrate each client’s value estimators, for example the security experts at the VA who needed to estimate the frequency of security breaches and their likely costs.

His first technique is the *equivalent bet test*. Suppose you’re asked to give a 90% CI for the year in which Newton published the universal laws of gravitation, and you can win $1,000 in one of two ways:

- You win $1,000 if the true year of publication falls within your 90% CI. Otherwise, you win nothing.
- You spin a dial divided into two “pie slices,” one covering 10% of the dial, and the other covering 90%. If the dial lands on the small slice, you win nothing. If it lands on the big slice, you win $1,000.

If you find yourself preferring option #2, then you must think spinning the dial has a higher chance of winning you $1,000 than option #1. That suggest your stated 90% CI isn’t really your 90% CI. Maybe it’s your 65% CI or your 80% CI instead. By preferring option #2, your brain is trying to tell you that your originally stated 90% CI is overconfident.

If instead you find yourself preferring option #1, then you must think there is *more* than a 90% chance your stated 90% CI contains the true value. By preferring option #1, your brain is trying to tell you that your original 90% CI is under confident.

To make a better estimate, adjust your 90% CI until option #1 and option #2 seem equally good to you. Research suggests that even *pretending* to bet money in this way will improve your calibration.

Hubbard’s second method for improving calibration is simply *repetition and feedback*. Make lots of estimates and then see how well you did. For this, play CFAR’s Calibration Game.

Hubbard also asks people to identify reasons why a particular estimate might be right, and why it might be wrong.

He also asks people to look more closely at each bound (upper and lower) on their estimated range. A 90% CI “means there is a 5% chance the true value could be greater than the upper bound, and a 5% chance it could be less than the lower bound. This means the estimators must be 95% sure that the true value is less than the upper bound. If they are not that certain, they should increase the upper bound… A similar test is applied to the lower bound.”

#### Simulations

Once you determine what you know about the uncertainties involved, how can you use that information to determine what you know about the *risks* involved? Hubbard summarizes:

…all risk in any project… can be expressed by one method: the ranges of uncertainty on the costs and benefits, and probabilities on events that might affect them.

The simplest tool for measuring such risks accurately is the Monte Carlo (MC) simulation, which can be run by Excel and many other programs. To illustrate this tool, suppose you are wondering whether to lease a new machine for one step in your manufacturing process.

The one-year lease [for the machine] is $400,000 with no option for early cancellation. So if you aren’t breaking even, you are still stuck with it for the rest of the year. You are considering signing the contract because you think the more advanced device will save some labor and raw materials and because you think the maintenance cost will be lower than the existing process.

Your pre-calibrated estimators give their 90% CIs for the following variables:

- Maintenance savings (MS): $10 to $20 per unit
- Labor savings (LS): -$2 to $8 per unit
- Raw materials savings (RMS): $3 to $9 per unit
- Production level (PL): 15,000 to 35,000 units per year

Thus, your annual savings will equal (MS + LS + RMS) × PL.

When measuring risk, we don’t just want to know the “average” risk or benefit. We want to know the probability of a huge loss, the probability of a small loss, the probability of a huge savings, and so on. That’s what Monte Carlo can tell us.

An MC simulation uses a computer to randomly generate thousands of possible values for each variable, based on the ranges we’ve estimated. The computer then calculates the outcome (in this case, the annual savings) for each generated combination of values, and we’re able to see how often different kinds of outcomes occur.

To run an MC simulation we need not just the 90% CI for each variable but also the *shape* of each distribution. In many cases, the normal distribution will work just fine, and we’ll use it for all the variables in this simplified illustration. (Hubbard’s book shows you how to work with other distributions).

To make an MC simulation of a normally distributed variable in Excel, we use this formula:

=norminv(rand(), mean, standard deviation)

So the formula for the maintenance savings variable should be:

=norminv(rand(), 15, (20–10)/3.29)

Suppose you enter this formula on cell A1 in Excel. To generate (say) 10,000 values for the maintenance savings value, just (1) copy the contents of cell A1, (2) enter “A1:A10000” in the cell range field to select cells A1 through A10000, and (3) paste the formula into all those cells.

Now we can follow this process in other columns for the other variables, including a column for the “total savings” formula. To see how many rows made a total savings of $400,000 or more (break-even), use Excel’s countif function. In this case, you should find that about 14% of the scenarios resulted in a savings of less than $400,000 – a loss.

We can also make a histogram (see right) to show how many of the 10,000 scenarios landed in each $100,000 increment (of total savings). This is even more informative, and tells us a great deal about the distribution of risk and benefits we might incur from investing in the new machine. (Download the full spreadsheet for this example here.)

The simulation concept can (and in high-value cases *should*) be carried beyond this simple MC simulation. The first step is to learn how to use a greater variety of distributions in MC simulations. The second step is to deal with correlated (rather than independent) variables by generating correlated random numbers or by modeling what the variables have in common.

A more complicated step is to use a Markov simulation, in which the simulated scenario is divided into many time intervals. This is often used to model stock prices, the weather, and complex manufacturing or construction projects. Another more complicated step is to use an agent-based model, in which independently-acting agents are simulated. This method is often used for traffic simulations, in which each vehicle is modeled as an agent.

### Step 3: Pick a variable, and compute the value of additional information for that variable

Information can have three kinds of value:

- Information can affect people’s behavior (e.g. common knowledge of germs affects sanitation behavior).
- Information can have its own market value (e.g. you can sell a book with useful information).
- Information can reduce uncertainty about important decisions. (This is what we’re focusing on here.)

When you’re uncertain about a decision, this means there’s a chance you’ll make a non-optimal choice. The cost of a “wrong” decision is the difference between the wrong choice and the choice you would have made with perfect information. But it’s too costly to acquire perfect information, so instead we’d like to know which decision-relevant variables are the *most* valuable to measure more precisely, so we can decide which measurements to make.

Here’s a simple example:

Suppose you could make $40 million profit if [an advertisement] works and lose $5 million (the cost of the campaign) if it fails. Then suppose your calibrated experts say they would put a 40% chance of failure on the campaign.

The expected opportunity loss (EOL) for a choice is the probability of the choice being “wrong” times the cost of it being wrong. So for example the EOL if the campaign is approved is $5M × 40% = $2M, and the EOL if the campaign is rejected is $40M × 60% = $24M.

The difference between EOL before and after a measurement is called the “expected value of information” (EVI).

In most cases, we want to compute the VoI for a range of values rather than a binary succeed/fail. So let’s tweak the advertising campaign example and say that a calibrated marketing expert’s 90% CI for sales resulting from the campaign was from 100,000 units to 1 million units. The risk is that we don’t sell enough units from this campaign to break even.

Suppose we profit by $25 per unit sold, so we’d have to sell at least 200,000 units from the campaign to break even (on a $5M campaign). To begin, let’s calculate the expected value of *perfect* information (EVPI), which will provide an upper bound on how much we should spend to reduce our uncertainty about how many units will be sold as a result of the campaign. Here’s how we compute it:

- Slice the distribution of our variable into thousands of small segments.
- Compute the EOL for each segment. EOL = segment midpoint times segment probability.
- Sum the products from step 2 for all segments.

Of course, we’ll do this with a computer. For the details, see Hubbard’s book and the Value of Information spreadsheet from his website.

In this case, the EVPI turns out to be about $337,000. This means that we shouldn’t spend more than $337,000 to reduce our uncertainty about how many units will be sold as a result of the campaign.

And in fact, we should probably spend much less than $337,000, because no measurement we make will give us *perfect* information. For more details on how to measure the value of *imperfect* information, see Hubbard’s book and these three LessWrong posts: (1) VoI: 8 Examples, (2) VoI: Four Examples, and (3) 5-second level case study: VoI.

I do, however, want to quote Hubbard’s comments about the “measurement inversion”:

By 1999, I had completed the… Applied Information Economics analysis on about 20 major [IT] investments… Each of these business cases had 40 to 80 variables, such as initial development costs, adoption rate, productivity improvement, revenue growth, and so on. For each of these business cases, I ran a macro in Excel that computed the information value for each variable… [and] I began to see this pattern: * The vast majority of variables had an information value of zero… * The variables that had high information values were routinely those that the client had never measured… * The variables that clients [spent] the most time measuring were usually those with a very low (even zero) information value… …since then, I’ve applied this same test to another 40 projects, and… [I’ve] noticed the same phenomena arise in projects relating to research and development, military logistics, the environment, venture capital, and facilities expansion.

Hubbard calls this the “Measurement Inversion”:

In a business case, the economic value of measuring a variable is usually inversely proportional to how much measurement attention it usually gets.

Here is one example:

A stark illustration of the Measurement Inversion for IT projects can be seen in a large UK-based insurance client of mine that was an avid user of a software complexity measurement method called “function points.” This method was popular in the 1980s and 1990s as a basis of estimating the effort for large software development efforts. This organization had done a very good job of tracking initial estimates, function point estimates, and actual effort expended for over 300 IT projects. The estimation required three or four full-time persons as “certified” function point counters…

But a very interesting pattern arose when I compared the function point estimates to the initial estimates provided by project managers… The costly, time-intensive function point counting did change the initial estimate but, on average, it was no closer to the actual project effort than the initial effort… Not only was this the single largest measurement effort in the IT organization, it literally added

novalue since it didn’t reduce uncertainty at all. Certainly, more emphasis on measuring the benefits of the proposed projects – or almost anything else – would have been better money spent.

Hence the importance of calculating EVI.

### Step 4: Apply the relevant measurement instrument(s) to the high-information-value variable

If you followed the first three steps, then you’ve defined a variable you want to measure in terms of the decision it affects and how you observe it, you’ve quantified your uncertainty about it, and you’ve calculated the value of gaining additional information about it. Now it’s time to reduce your uncertainty about the variable – that is, to measure it.

Each scientific discipline has its own specialized measurement methods. Hubbard’s book describes measurement methods that are often useful for reducing our uncertainty about the “softer” topics often encountered by decision-makers in business.

#### Selecting a measurement method

To figure out which category of measurement methods are appropriate for a particular case, we must ask several questions:

- Decomposition: Which parts of the thing are we uncertain about?
- Secondary research: How has the thing (or its parts) been measured by others?
- Observation: How do the identified observables lend themselves to measurement?
- Measure just enough: How much do we need to measure it?
- Consider the error: How might our observations be misleading?

##### Decomposition

Sometimes you’ll want to start by decomposing an uncertain variable into several parts to identify which observables you can most easily measure. For example, rather than directly estimating the cost of a large construction project, you could break it into parts and estimate the cost of each part of the project.

In Hubbard’s experience, it’s often the case that decomposition itself – even without making any new measurements – often reduces one’s uncertainty about the variable of interest.

##### Secondary research

Don’t reinvent the world. In almost all cases, someone has already invented the measurement tool you need, and you just need to find it. Here are Hubbard’s tips on secondary research:

- If you’re new to a topic, start with Wikipedia rather than Google. Wikipedia will give you a more organized perspective on the topic at hand.
- Use search terms often associated with quantitative data. E.g. don’t just search for “software quality” or “customer perception” – add terms like “table,” “survey,” “control group,” and “standard deviation.”
- Think of internet research in two levels: general search engines and topic-specific repositories (e.g. the CIA World Fact Book).
- Try multiple search engines.
- If you find marginally related research that doesn’t directly address your topic of interest, check the bibliography more relevant reading material.

I’d also recommend my post Scholarship: How to Do It Efficiently.

##### Observation

If you’re not sure how to measure your target variable’s observables, ask these questions:

- Does it leave a trail? Example: longer waits on customer support lines cause customers to hang up and not call back. Maybe you can also find a correlation between customers who hang up after long waits and reduced sales to those customers.
- Can you observe it directly? Maybe you haven’t been tracking how many of the customers in your parking lot show an out-of-state license, but you could start. Or at least, you can observe a sample of these data.
- Can you create a way to observe it indirectly? Amazon.com added a gift-wrapping feature in part so they could better track how many books were being purchased as gifts. Another example is when consumers are given coupons so that retailers can see which newspapers their customers read.
- Can the thing be forced to occur under new conditions which allow you to observe it more easily? E.g. you could implement a proposed returned-items policy in some stores but not others and compare the outcomes.

##### Measure just enough

Because initial measurements often tell you quite a lot, and also change the value of continued measurement, Hubbard often aims for spending 10% of the EVPI on a measurement, and sometimes as little as 2% (especially for very large projects).

##### Consider the error

It’s important to be conscious of some common ways in which measurements can mislead.

Scientists distinguish two types of measurement error: systemic and random. Random errors are random variations from one observation to the next. They can’t be individually predicted, but they fall into patterns that can be accounted for with the laws of probability. Systemic errors, in contrast, are consistent. For example, the sales staff may routinely overestimate the next quarter’s revenue by 50% (on average).

We must also distinguish precision and accuracy. A “precise” measurement tool has low random error. E.g. if a bathroom scale gives the exact same displayed weight every time we set a particular book on it, then the scale has high precision. An “accurate” measurement tool has low systemic error. The bathroom scale, while precise, might be inaccurate if the weight displayed is systemically biased in one direction – say, eight pounds too heavy. A measurement tool can also have low precision but good accuracy, if it gives inconsistent measurements but they average to the true value.

Random error tends to be easier to handle. Consider this example:

For example, to determine how much time sales reps spend in meetings with clients versus other administrative tasks, they might choose a complete review of all time sheets… [But] if a complete review of 5,000 time sheets… tells us that sales reps spend 34% of their time in direct communication with customers, we still don’t know how far from the truth it might be. Still, this “exact” number seems reassuring to many managers. Now, suppose a sample of direct observations of randomly chosen sales reps at random points in time finds that sales reps were in client meetings or on client phone calls only 13 out of 100 of those instances. (We can compute this without interrupting a meeting by asking as soon as the rep is available.) As we will see [later], in the latter case, we can statistically compute a 90% CI to be 7.5% to 18.5%. Even though this random sampling approach gives us only a range, we should prefer its findings to the census audit of time sheets. The census… gives us an exact number, but we have no way to know by how much and in which direction the time sheets err.

Systemic error is also called a “bias.” Based on his experience, Hubbard suspects the three most important to avoid are:

- Confirmation bias: people see what they want to see.
- Selection bias: your sample might not be representative of the group you’re trying to measure.
- Observer bias: the very act of observation can affect what you observe. E.g. in one study, researchers found that worker productivity improved no matter
*what*they changed about the workplace. The workers seem to have been responding merely to the*fact*that they were being observed in*some*way.

##### Choose and design the measurement instrument

After following the above steps, Hubbard writes, “the measurement instrument should be almost completely formed in your mind.” But if you still can’t come up with a way to measure the target variable, here are some additional tips:

*Work through the consequences*. If the value is surprisingly high, or surprisingly low, what would you expect to see?*Be iterative*. Start with just a few observations, and then recalculate the information value.*Consider multiple approaches*. Your first measurement tool may not work well. Try others.*What’s the really simple question that makes the rest of the measurement moot?*First see if you can detect*any*change in research quality before trying to measure it more comprehensively.

#### Sampling reality

In most cases, we’ll estimate the values in a population by measuring the values in a small sample from that population. And for reasons discussed in chapter 7, a very small sample can often offer large reductions in uncertainty.

There are a variety of tools we can use to build our estimates from small samples, and which one we should use often depends on how outliers are distributed in the population. In some cases, outliers are very close to the mean, and thus our estimate of the mean can converge quickly on the true mean as we look at new samples. In other cases, outliers can be several orders of magnitude away from the mean, and our estimate converges very slowly or not at all. Here are some examples:

- Very quick convergence, only 1–2 samples needed: cholesterol level of your blood, purity of public water supply, weight of jelly beans.
- Usually quickly convergence, 5–30 samples needed: Percentage of customers who like the new product, failure loads of bricks, age of your customers, how many movies people see in a year.
- Potentially slow convergence: Software project cost overruns, factory downtime due to an accident.
- Maybe non-convergent: Market value of corporations, individual levels of income, casualties of wars, size of volcanic eruptions.

Below, I survey just a few of the many sampling methods Hubbard covers in his book.

##### Mathless estimation

When working with a quickly converging phenomenon and a symmetric distribution (uniform, normal, camel-back, or bow-tie) for the population, you can use the t-statistic to develop a 90% CI even when working with very small samples. (See the book for instructions.)

Or, even easier, make use of the *Rule of FIve*: “There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.”

The Rule of Five has another advantage over the t-statistic: it works for any distribution of values in the population, including ones with slow convergence or no convergence at all! It can do this because it gives us a confidence interval for the *median* rather than the *mean*, and it’s the mean that is far more affected by outliers.

Hubbard calls this a “mathless” estimation technique because it doesn’t require us to take square roots or calculate standard deviation or anything like that. Moreover, this mathless technique extends beyond the Rule of Five: If we sample 8 items, there is a 99.2% chance that the median of the population falls within the largest and smallest values. If we take the *2nd* largest and smallest values (out of 8 total values), we get something close to a 90% CI for the median. Hubbard generalizes the tool with this handy reference table:

And if the distribution is symmetrical, then the mathless table gives us a 90% CI for the mean as well as for the median.

##### Catch-recatch

How does a biologist measure the number of fish in a lake? SHe catches and tags a sample of fish – say, 1000 of them – and then releases them. After the fish have had time to spread amongst the rest of the population, she’ll catch another sample of fish. Suppose she caught 1000 fish again, and 50 of them were tagged. This would mean 5% of the fish were tagged, and thus that were about 20,000 fish in the entire lake. (See Hubbard’s book for the details on how to calculate the 90% CI.)

##### Spot sampling

The fish example was a special case of a common problem: population proportion sampling. Often, we want to know what proportion of a population has a particular trait. How many registered voters in California are Democrats? What percentage of your customers prefer a new product design over the old one?

Hubbard’s book discusses how to solve the general problem, but for now let’s just consider another special case: spot sampling.

In spot sampling, you take random snapshots of things rather than tracking them constantly. What proportion of their work hours do employees spend on Facebook? To answer this, you “randomly sample people through the day to see what they were doing *at that moment*. If you find that in 12 instances out of 100 random samples” employees were on Facebook, you can guess they spend about 12% of their time on Facebook (the 90% CI is 8% to 18%).

##### Clustered sampling

Hubbard writes:

“Clustered sampling” is defined as taking a random sample of groups, then conducting a census or a more concentrated sampling within the group. For example, if you want to see what share of households has satellite dishes… it might be cost effective to randomly choose several city blocks, then conduct a complete census of everything in a block. (Zigzagging across town to individually selected households would be time consuming.) In such cases, we can’t really consider the number of [households] in the groups… to be the number of random samples. Within a block, households may be very similar… [and therefore] it might be necessary to treat the effective number of random samples as the number of blocks…

##### Measure to the threshold

For many decisions, one decision is required if a value is above some threshold, and another decision is required if that value is below the threshold. For such decisions, you don’t care as much about a measurement that reduces uncertainty in general as you do about a measurement that tells you which decision to make based on the threshold. Hubbard gives an example:

Suppose you needed to measure the average amount of time spent by employees in meetings that could be conducted remotely… If a meeting is among staff members who communicate regularly and for a relatively routine topic, but someone has to travel to make the meeting, you probably can conduct it remotely. You start out with your calibrated estimate that the median employee spends between 3% to 15% traveling to meetings that could be conducted remotely. You determine that if this percentage is actually over 7%, you should make a significant investment in tele meetings. The [EVPI] calculation shows that it is worth no more than $15,000 to study this. According to our rule of thumb for measurement costs, we might try to spend about $1,500…

Let’s say you sampled 10 employees and… you find that only 1 spends less time in these activities than the 7% threshold. Given this information, what is the chance that the median time spent in such activities is actually below 7%, in which case the investment would not be justified? One “common sense” answer is 1/10, or 10%. Actually… the real chance is much smaller.

Hubbard shows how to derive the real chance in his book. The key point is that “the uncertainty about the threshold can fall much faster than the uncertainty about the quantity in general.”

##### Regression modeling

What if you want to figure out the cause of something that has many possible causes? One method is to perform a *controlled experiment*, and compare the outcomes of a test group to a control group. Hubbard discusses this in his book (and yes, he’s a Bayesian, and a skeptic of p-value hypothesis testing). For this summary, I’ll instead mention another method for isolating causes: regression modeling. Hubbard explains:

If we use regression modeling with historical data, we may not need to conduct a controlled experiment. Perhaps, for example, it is difficult to tie an IT project to an increase in sales, but we might have lots of data about how something

elseaffects sales, such as faster time to market of new products. If we know that faster time to market is possible by automating certain tasks, that this IT investment eliminates certain tasks, and those tasks are on the critical path in the time-to-market, we can make the connection.

Hubbard’s book explains the basics of linear regressions, and of course gives the caveat that correlation does not imply causation. But, he writes, “you should conclude that one thing causes another only if you have some *other* good reason besides the correlation itself to suspect a cause-and-effect relationship.”

#### Bayes

Hubbard’s 10th chapter opens with a tutorial on Bayes’ Theorem. For an online tutorial, see here.

Hubbard then zooms out to a big-picture view of measurement, and recommends the “instinctive Bayesian approach”:

- Start with your calibrated estimate.
- Gather additional information (polling, reading other studies, etc.)
- Update your calibrated estimate subjectively, without doing any additional math.

Hubbard says a few things in support of this approach. First, he points to some studies (e.g. El-Gamal & Grether (1995)) showing that people often reason in roughly-Bayesian ways. Next, he says that in his experience, people become better intuitive Bayesians when they (1) are made aware of the base rate fallacy, and when they (2) are better calibrated.

Hubbard says that once these conditions are met,

[then] humans seem to be mostly logical when incorporating new information into their estimates along with the old information. This fact is extremely useful because a human can consider qualitative information that does not fit in standard statistics. For example, if you were giving a forecast for how a new policy might change “public image” – measured in part by a reduction in customer complaints, increased revenue, and the like – a calibrated expert should be able to update current knowledge with “qualitative” information about how the policy worked for other companies, feedback from focus groups, and similar details. Even with sampling information, the calibrated estimator – who has a Bayesian instinct – can consider qualitative information on samples that most textbooks don’t cover.

He also offers a chart showing how a pure Bayesian estimator compares to other estimators:

Also, Bayes’ Theorem allows us to perform a “Bayesian inversion”:

Given a particular observation, it may seem more obvious to frame a measurement by asking the question “What can I conclude from this observation?” or, in probabilistic terms, “What is the probability X is true, given my observation?” But Bayes showed us that we could, instead, start with the question, “What is the probability of this observation if X were true?”

The second form of the question is useful because the answer is often more straightforward and it leads to the answer to the other question. It also forces us to think about the likelihood of different observations given a particular hypothesis and what that means for interpreting an observation.

[For example] if, hypothetically, we know that only 20% of the population will continue to shop at our store, then we can determine the chance [that] exactly 15 out of 20 would say so… [The details are explained in the book.] Then we can invert the problem with Bayes’ theorem to compute the chance that only 20% of the population will continue to shop there given [that] 15 out of 20 said so in a random sample. We would find that chance to be very nearly zero…

#### Other methods

Other chapters discuss other measurement methods, for example prediction markets, Rasch models, methods for measuring preferences and happiness, methods for improving the subjective judgments of experts, and many others.

### Step 5: Make a decision and act on it

The last step will make more sense if we first “bring the pieces together.” Hubbard now organizes his consulting work with a firm into 3 phases, so let’s review what we’ve learned in the context of his 3 phases.

#### Phase 0: Project Preparation

*Initial research*: Interviews and secondary research to get familiar on the nature of the decision problem.*Expert identification*: Usually 4–5 experts who provide estimates.

#### Phase 1: Decision Modeling

*Decision problem definition*: Experts define the problem they’re trying to analyze.*Decision model detail*: Using an Excel spreadsheet, the AIE analyst elicits from the experts all the factors that matter for the decision being analyzed: costs and benefits, ROI, etc.*Initial calibrated estimates*: First, the experts undergo calibration training. Then, they fill in the values (as 90% CIs or other probability distributions) for the variables in the decision model.

#### Phase 2: Optimal measurements

*Value of information analysis*: Using Excel macros, the AIE analyst runs a value of information analysis on every variable in the model.*Preliminary measurement method designs*: Focusing on the few variables with highest information value, the AIE analyst chooses measurement methods that should reduce uncertainty.*Measurement methods*: Decomposition, random sampling, Bayesian inversion, controlled experiments, and other methods are used (as appropriate) to reduce the uncertainty of the high-VoI variables.*Updated decision model*: The AIE analyst updates the decision model based on the results of the measurements.*Final value of information analysis*: The AIE analyst runs a VoI analysis on each variable again. As long as this analysis shows information value much greater than the cost of measurement for some variables, measurement and VoI analysis continues in multiple iterations. Usually, though, only one or two iterations are needed before the VoI analysis shows that no further measurements are justified.

#### Phase 3: Decision optimization and the final recommendation

*Completed risk/return analysis*: A final MC simulation shows the likelihood of possible outcomes.*Identified metrics procedures*: Procedures are put in place to measure some variables (e.g. about project progress or external factors) continually.*Decision optimization*: The final business decision recommendation is made (this is rarely a simple “yes/no” answer).

#### Final thoughts

Hubbard’s book includes two case studies in which Hubbard describes how he led two fairly different clients (the EPA and U.S. Marine Corps) through each phase of the AIE process. Then, he closes the book with the following summary:

- If it’s really that important, it’s something you can define. If it’s something you think exists at all, it’s something you’ve already observed somehow.
- If it’s something important and something uncertain, you have a cost of being wrong and a chance of being wrong.
- You can quantify your current uncertainty with calibrated estimates.
- You can compute the value of additional information by knowing the “threshold” of the measurement where it begins to make a difference compared to your existing uncertainty.
- Once you know what it’s worth to measure something, you can put the measurement effort in context and decide on the effort it should take.
- Knowing just a few methods for random sampling, controlled experiments, or even merely improving on the judgments of experts can lead to a significant reduction in uncertainty.

## 50 comments

Comments sorted by top scores.

A measurement is an observation that quantitatively reduces uncertainty.

A measurement reduces expected uncertainty. Some particular measurement results increase uncertainty. E.g. you start out by assigning 90% probability that a binary variable landed heads and then you see evidence with a likelihood ratio of 1:9 favoring tails, sending your posterior to 50-50. However the *expectation* of the entropy of your probability distribution after seeing the evidence, is always evaluated to be lower than its current value in advance of seeing the evidence.

Just FYI, I think Hubbard knows this and wrote "A measurement is an observation that quantitatively reduces uncertainty" because he was trying to simplify and avoid clunky sentences. E.g. on p. 146 he writes:

It is even possible for an additional sample to sometimes increase the size of the [confidence] interval... before the next sample makes it narrower again. But, on average, the increasing sample size will decrease the size of the [confidence] interval.

I'm reminded also of Russell's comment:

A book should have either intelligibility or correctness; to combine the two is impossible, but to lack both is to be unworthy.

The technical term for this is conditional entropy.

The conditional entropy will always be lower *unless* the evidence is independent of your hypothesis(in this case the conditional entropy will be equal to the prior entropy).

Is there a section on "How To Not Fool Yourself That You're Measuring X When You're Actually Measuring Z"?

This is a very important concern that I have too. I have not read the book, and it might be a very interesting read, but when it starts with:

No matter how “fuzzy” the measurement is, it’s still a measurement if it tells you more than you knew before.

It concerns me. Because business is already full of dubious metrics that actually do harm. For instance, in programming, source lines of code (SLOC) per month is one metric that is used to gauge 'programmer productivity', but has come under extreme and rightful skepticism.

Scientific methods are powerful when used properly, but a little knowledge can be a dangerous thing.

Yes he is all over this.

In the TQM world this comes under the heading "any metric used to reward people will become corrupt". H Edwards Deming was writing about this issue in the 1960s or earlier. For this reason he advocated separating data collection used to run the business from data collection used to reward people. Too often, people decide this is "inefficient" and combine the two, with predictable results. Crime statistics in the US is one terrible example of this.

From my recollection of the book I think he would say that SLOC is not actually a terrible metric and can be quite useful. I personally use it myself on my own projects - but I have no incentive to game the system. If you start paying people for SLOC you are going to get a lot of SLOCs!

Because of the history, you need to go overboard to reassure people you will not use metrics against them. They are going to assume you will use them against them, until proven otherwise.

Not a dedicated section; this advice is scattered throughout the book. E.g. there's a section (p. 174-176) explaining why p-value hypothesis testing doesn't measure what the reader might think it measures (and thus Hubbard doesn't use p-value hypothesis testing).

Wow, this is really exciting. I thought at first, "Man, quantifying my progress on math research sounds really difficult. I don't know how to make it more than a measure of how happy I feel about what I've done."

But I'm only through step one of this post, and I've already pinned down the variables defining "progress on math research" such that measuring these periodically will almost certainly keep me directly on track toward the answer. I can probably make it better (suggestions welcome), but even this first pass saves me lots of grief. Wasted motion is probably my biggest problem with learning math right now, so this totally rocks. Check it out!

Progress is reduction of expected work remaining.

-Number of new things I understand (by proposition). -Change in degree to which I understand old things. -How central these things are to the problem. -Number of things (namely propositions, definitions, rules/instructions) I’ve written down that seem likely to be useful reference later. -Probability that they will be important. -Amount of material produced (in propositions or subsections of proof) that, if correct, will actually be part of my answer in the end. -Number of actions I’ve taken that will increase the values of the above variables. -Degree to which they’ll increase those values.

Thanks, Luke!

Progress is reduction of expected work remaining.

No it isn't. Those things are often correlated but not equivalent. New information can be gained that increases the expected work remaining despite additional valuable work having been done.

Progress is reduction of expected work remaining compared to your *revised* expectation of how much work remained yesterday.

That seems to be a better fit for the *impression of progress*. You wouldn't tend, in retrospect, to call it progress if you realised you'd been going in completely the wrong direction.

This would fit with progress simply be the reduction of work remaining.

Right. I think this is more an operationalization than a strict definition.

Yes, what RobbBB said.

|New information can be gained that increases the expected work remaining despite additional valuable work having been done.

That's progress.

|New information can be gained that increases the expected work remaining despite additional valuable work having been done.

That's progress.

Yes. That is the point.

The variables that had high information values were routinely those that the client had never measured… * The variables that clients [spent] the most time measuring were usually those with a very low (even zero) information value…

This seems very unlikely to be a coincidence. Any theories about what's going on?

We run into this all the time at my job.

My usual interpretation is that actual measurements with high information value can destabilize the existing system (e.g., by demonstrating that people aren't doing their jobs, or that existing strategies aren't working or are counterproductive), and are therefore dangerous. Low-information measurements are safer.

It's not that they're measuring the wrong variables, it's most likely that those organizations have already made the decisions based on variables they already measure. In the "Function Points" example, I would bet there were a few obvious learnings early on that spread throughout the organizations, and once the culture had changed any further effort didn't help at all.

Another example: I took statistics on how my friends played games that involved bidding, such as Liar's Poker. I found that they typically would bid too much. Therefore a measurement of how many times someone had the winning bid was a high predictor of how they would perform in the game-people who bid high would typically lose.

Once I shared this information, behavior changed and people started using a much more rational bidding scheme. And the old measurement of "how often someone bid high" was no longer very predictive. It simply meant that they'd had more opportunities where bidding high made sense. Other variables such as "the player to your left" started becoming much more predictive.

One possibility is that there are a very large number of things they could measure, most of which have low information value. If they chose randomly we might expect to see an effect like this, and never notice all the low information possibilities they chose not to measure.

I'm not suggesting that they actually do choose randomly, but it might be they chose, say, the easiest to measure, and that these are neither systematically good or bad, so it looks similar to random in terms of the useful information.

in the many cases I've seen this its because (generally) things that are being collected are those things which are easiest to be collected. Often little thought was put into it, and sometimes these things were collected by accident. Generally, those things easiest to be collected offer the least insight (if its easy to collect, its already part of your existing business process).

If there are generally decreasing returns to measurement of a single variable, I think this is more what we would expect see. If you've already put effort into measurement of a given variable it will have lower information value on the margin. If you add in enough costs for switching measurements, then even the optimal strategy might spend a serious amount of time/effort pursuing lower value measurements.

Further, if they hadn't even thought of some measurements they couldn't have pursued them, so they wouldn't have suffered any declining returns.

I don't think this is the primary reason, but may contribute, especially in conjunction with reasons from sibling comments.

Great book Luke. Im looking forward to the reviews of your other favorites! :)

Thanks for the great summary of the book. I would like to create and publish a mindmap of the book and reuse some comments you made in this post.

Would it be OK for you as the author of the post if I will create a mindmap of the book with reference to your article and publish it? The link to this post on LessWrong will be included in the mindmap.

Best Regards,

Andrey K. imankey@gmail.com

So... how do you measure programmer productivity? ;)

What do you mean by "programmer productivity" and why do you care about it? What are you observing when you observe increased programmer productivity?

Implementation of desired things per unit time.

I haven't read the article so I could be full of shit, but essentially:

If you have the list of desired things ready, there should be an ETA on the work time necessary for each desired thing as well as confidence on that estimate. Confidence varies with past data and expected competence, e.g. how easily you believe you can debug the feature if you begin to draft it. Or such. Then you have a set of estimates for each implementable feature.

Then you put in time on that feature over the day tracked by some passive monitoring program like ManictTime or something like it.

The ratio of time spent on work that counted towards your features over the work that didn't is your productivity metric. As time goes on your confidence is calibrated in your feature-implementation work time estimates.

How do you measure desired things?

It's not very hard.

Just recently the task before me was to implement an object selection feature in an app I'm working on.

I implemented it. Now, the app lets the user select objects. Before, it didn't.

Prior to that, the task before me was to fix a data corruption bug in a different app.

I fixed it. Now, the data does not get corrupted when the user takes certain actions. Before, it did.

You see? Easy.

So, I agree that you accomplished these desired things. However, *before* you accomplished them, how accurately did you know how much time they would take, or how useful they would be?

For that matter, if someone told you, "That wasn't one desired thing I just implemented; it was three," is it possible to disagree?

(My point is that "desired thing" is not well-defined, and so "desired things per unit time" cannot be a measurement..)

However, before you accomplished them, how accurately did you know how much time they would take

I didn't. I never said I did.

or how useful they would be?

Uh, pretty accurately. Object selection is a critical feature; the entire functionality of the app depends on it. The usefulness of not having your data be corrupted is also obvious. I'm not really sure what you mean by asking whether I know in advance how useful a feature or bug fix will be. Of course I know. How could I not know? I always know.

For that matter, if someone told you, "That wasn't one desired thing I just implemented; it was three," is it possible to disagree?

Ah, now this is a different matter. Yes, "desired thing" is not a uniform unit of accomplishment. You have to compare to other desired things, and other people who implement them. You can also group desired things into classes (is this a bug fix or a feature addition? how big a feature? how much code must be written or modified to implement it? how many test cases must be run to isolate this bug?).

Yes, "desired thing" is not a uniform unit of accomplishment.

Right! So, "implementation of desired things per unit time" is not a measure of programmer productivity, since you can't really use it to compare the work of one programmer and another.

There are obvious cases, of course, where you *can* — here's someone who pounds out a reliable map-reduce framework in a weekend; there's someone who can't get a quicksort to compile. But if two *moderately successful* (and moderately reasonable) programmers *disagree* about their productivity, this candidate measurement doesn't help us resolve that disagreement.

Well, if your goal is comparing two programmers, then the most obvious thing to do is to give them both the same set of diverse tasks, and see how long they take (on each task and on the whole set).

If your goal is gauging the effectiveness of this or that approach (agile vs. waterfall? mandated code formatting style or no? single or pair programming? what compensation structure? etc.), then it's slightly less trivial, but you can use some "fuzzy" metrics: for instance, classify "desired things" into categories (feature, bug fix, compatibility fix, etc.), and measure *those* per unit time.

As for disagreeing whether something is one desired thing or three — well, like I said, you categorize. But also, it really won't be the case that one programmer says "I just implemented a feature", and another goes "A feature?! You just moved one parenthesis!", and a third goes "A feature?! You just wrote the entire application suite!".

Well, if your goal is comparing two programmers, then the most obvious thing to do is to give them both the same set of diverse tasks, and see how long they take (on each task and on the whole set).

That might work in an academic setting, but doesn't work in a real-life business setting where you're not going to tie up two programmers (or two teams, more likely) reimplementing the same stuff just to satisfy your curiosity.

And of course programming is diverse enough to encompass a wide variety of needs and skillsets. Say, programmer A is great at writing small self-contained useful libraries, programmer B has the ability to refactor a mess of spaghetti code into something that's clear and coherent, programmer C writes weird chunks of code that look strange but consume noticeably less resources, programmer D is a wizard at databases, programmer E is clueless about databases but really groks Windows GUI APIs, etc. etc. How are you going to compare their productivity?

That might work in an academic setting, but doesn't work in a real-life business setting where you're not going to tie up two programmers (or two teams, more likely) reimplementing the same stuff just to satisfy your curiosity.

Maybe that's one reason to have colleges that hand out computer science degrees? ;)

And of course programming is diverse enough to encompass a wide variety of needs and skillsets. Say, programmer A is great at writing small self-contained useful libraries, programmer B has the ability to refactor a mess of spaghetti code into something that's clear and coherent, programmer C writes weird chunks of code that look strange but consume noticeably less resources, programmer D is a wizard at databases, programmer E is clueless about databases but really groks Windows GUI APIs, etc. etc. How are you going to compare their productivity?

Very carefully.

In seriousness, the answer is that you wouldn't compare them. Comparing programmer productivity across problem domains like you describe is rarely all that useful.

You really only care about comparing programmer productivity within a domain, as well as comparing the same programmers' productivity across time.

How are you going to compare their productivity?

I'm going to look at the total desirability of the what Adam does, at the total desirability of what Bob does...

And in the end I'm going to have to make difficult calls, like how desirable it is for us to have weird chunks of code that look strange by consume noticeably fewer resources.

Each of them is better at different things, so as a manager I need to take that into account; I wouldn't use a carpenter to paint while the painter is doing framing, but I *might* set things up so that painter helps with the framing and the carpenter assists with the painting. I certainly wouldn't spend a lot of time optimizing to hire only carpenters and tell them to build the entire house.

It's a two-step process, right? First, you measure how long a specific type of feature takes to implement; from a bunch of historic examples or something. Then, you measure how long a programmer (or all the programmers using a particular methodology or language, whatever you're measuring), take to implement a new feature of the same type.

Hubbard writes about performance measurement in chapter 11. He notes that management typically knows what are the relevant performance metrics. However it has trouble prioritizing between them. Hubbard's proposal is to let the managament create utility charts of the required trade-offs. For instance on curve for programmer could have on-time completion rate in one axis and error-free rate in the other (page 214). Thus the management is required to document how much one must increase to compensate for drop in the other. The end product of the charts should be a single index for measuring employee performance.

**[deleted]**· 2013-09-22T15:48:44.111Z · score: 2 (2 votes) · LW · GW

The Applied Information Economics ideas are very reminiscent of decision tree algorithms. Would it be useful to try to extend the analogy and see if there's an extension of AIE that is like random forests?

Hanson's homo hypocritus idea may also be relevant. Perhaps, even subconsciously, people avoid measuring the dimensions or directions that will add a lot of info because they want to both (a) vociferously claim that they did measure stuff and the measures didn't help and (b) avoid any culpability for implementing changes they don't politically control, such as changes indicated by measuring very informative directions.

Just saying, a lot of people want to appear like they are productively exploring measures that yield changes and progress while tacitly sabotaging that very activity to retain political control over status quos.

Thanks, I liked this post.

However, I was initially a bit confused by the section on EVPI. I think it is important, but it could be a lot clearer.

The expected opportunity loss (EOL) for a choice is the probability of the choice being “wrong” times the cost of it being wrong. So for example the EOL if the campaign is approved is $5M × 40% = $2M, and the EOL if the campaign is rejected is $40M × 60% = $24M.

The difference between EOL before and after a measurement is called the “expected value of information” (EVI).

It seems quite unclear what's meant by "the difference between EOL before and after a measurement" (EOL of which option? is this in expectation?).

I think what must be intended is: your definition is for the EOL of an option. Now the EOL of a choice is the EOL of the option we choose given current beliefs. Then EVI is the expected reduction in EOL upon measurement.

Even this is more confusing than it often needs to be. At heart it's the expected amount better you'll do with the information. Sometimes you can factor out the EOL calculation entirely. For example say you're betting $10 at even odds on a biased coin. You currently think there's a 70% chance of it landing heads; more precisely you know it was either from a batch which lands heads 60% of the time, or from a batch which lands heads 80% of the time, but these are equiprobable. You could take a measurement to find out which batch it was from. Then you are certain that this measurement will change the EOL, but if you do it carefully the expected gain is equal to the expected loss, so there is no EVI. We could spot this directly because we know that whatever the answer is, we'll bet on heads.

I think it might be useful to complete your simple example for EVPI (as in, this would have helped me to understand it faster, so may help others too): Currently you'll run the campaign, with EOL of $2M. With perfect information, you always choose the right option, so you expect the EOL to go down to 0. Hence the EVPI is $2M (this comes from the 40% of the time that the information stops you running the campaign and saving you $5M).

Then in the section on the more advanced model:

In this case, the EVPI turns out to be about $337,000. This means that we shouldn’t spend more than $337,000 to reduce our uncertainty about how many units will be sold as a result of the campaign.

Does this figure come from the book? It doesn't come from the spreadsheet you linked to. By the way, there's a mistake in the spreadsheet: when it assumes a uniform distribution it uses different bounds for two different parts of the calculation.

I like the coin example. In my experience the situation with clear choice is typical in small businesses. It often isn't worth honing the valuation models for projects very long when it is very improbably that the presumed second best choice would turn out to be the best.

I guess the author is used to working for bigger companies that do everything in larger scale and thus have generally more options to choose from. Nothing untrue in the chapter but this point could have been pointed out.

Before I embark on this seemingly Sisyphean endeavor, has anyone attempted to measure "philosophical progress"? It seems that no philosophical problem I know of is apparently fully solved, and no general methods are known which reliably give true answers to philosophical problems. Despite this we definitely have made progress: e.g. we can chart human progress on the problem of Induction, of which an *extremely* rough sketch looks like Epicurus --> Occam --> Hume --> Bayes --> Solomonoff, or something. I don't really know, but there seem to be issues with Solomonoff's formalization of Induction.

I'm thinking of "philosophy" as something like "pre-mathematics/progressing on confusing questions that no reliable methods exist yet to give truthy answers/forming a concept of something and formalizing it". Also it's not clear to me "philosophy" exists independent of the techniques its spawned historically, but there are some problems for which the label of "philosophical problem" seems appropriate, e.g. "how do uncertainties work in a universe where infinite copies of you exist?" and like, all of moral philosophy, etc.

Thanks Luke, this is a great post. It seems like it applies in a very broad range of cases - except the one I'm most interested in, unfortunately, which is judging how bad it is to violate someone's rights. We regularly value a life in CBA calculations ( $1-10 million in the US), but how bad is it to be murdered, *holding constant that you die?*

This cost should be internalised by the murderer, but many people seem ignorant of the cost, leading to an over-supply of murder. It'd be good to know how big the market failure is (so we can judge various preventative policies).

Obviously the same question applies to theft, oath-breaking, and any other rights violations you might think of.

Applying the same question to theft produces the result that if I steal your car and I get more utility out of having your car than you lose by not having it + the utility that you lose from psychological harm due to theft, insurance premiums rising, etc., I can internalize the cost and still come out ahead, so this sort of theft is not in oversupply.

Of course, we normally don't consider the fact that the criminal gains utility to be relevant. Saying that it's not a market failure if the criminal is willing to internalize the cost implies that we consider the gain in the criminal's utility to be relevant.

Does the book address the issue of stale data?

Most statistics assumes that the underlying process is stable: if you're sampling from a population, you're sampling from the *same population* every time. If you estimated some parameters of model, the assumption is that these parameters will be applicable for the forecast period.

Unfortunately, in real life underlying processes tend to be unstable. For a trivial example of a known-to-not-be-stable process consider weather. Let's say I live outside of tropics and I measure air temperature over, say, 60 days. Will my temperature estimates provide a good forecast for the next month? No, they won't because the year has seasons and my "population" of days changes with time.

Or take an example from the book, catch-recatch. Imagine that a considerable period of time passed between the original "catch" and the "recatch". Does the estimation procedure still work? Well, not really -- you need estimates of mortality and birth rate now, you need to know how did your population change between the first and the second measurements.

I have difficulty in understanding the EVPI and link it to return rate,NVP and risk for a project. I have input variable and initial guess. I have run MC to determine NVP. Now I am bit lost how to go ahead with EVPI and link it to NVP, risk etc.

Thanks! Lots of useful stuff in there...

=norminv(rand(), 15, (20–10)/3.29)

Where did **3.29** come from?

In some cases, outliers are very close to the mean, and thus our estimate of the mean can converge quickly on the true mean as we look at new samples. In other cases, outliers can be several orders of magnitude away from the mean, and our estimate converges very slowly or not at all.

I think this passage confuses several different things. Let me try to untangle it.

First, all outliers, *by definition*, are rare and are "far away from the mean" (compared to the rest of the data points).

Second, whether your data points are "close" to the mean or "several orders of magnitude" away from the mean is a function of the width (or dispersion or variance or standard deviation or volatility) of the underlying distribution. The width affects how precise your mean estimate from a fixed-size sample will be, but it does not affect the *speed* of the convergence.

The speed of the convergence is a function of what your underlying distribution is. If it's normal (Gaussian), your mean estimate will converge at the same speed regardless of how high or low the variance of the distribution is. If it's, say, a Cauchy distribution then the mean estimate will never converge.

Also, in small samples you generally don't expect to get any outliers. If you do, your small-sample estimate is likely to be way out of whack and actually misleading.