What is an appropriate sample size when surveying billions of data points?
post by Blake (blake-1) · 2024-08-23T21:54:21.386Z · LW · GW · 1 commentThis is a question post.
Contents
Answers 3 Blake None 1 comment
I am surveying Internet web accessibility. We're hoping to draw big conclusions on how accessible the Internet is to people with disabilities. What is the appropriate sample size? We should draw in order to start making our grand conclusions?
I went down Google/Perplexity rabbit hole trying to figure out an appropriate answer. I am left with more questions than when I started, so I turned to the LW community.
Answers
FWIW, here is the best survey of ideas on sample size I found: https://cadcommunity.pbworks.com/f/what%20sample%20size.pdf
1 comment
Comments sorted by top scores.
comment by gwern · 2024-08-23T22:39:22.075Z · LW(p) · GW(p)
Since you mention 'billions of data points', but you say your goal is 'how accessible the Internet is to people with disabilities' where your sample size should be more like in the hundreds to thousands, you may need to seriously think about what the purpose of your survey is and how it is used. Planning sample size is the least of your problems.
It sounds like you think you can just take some dataset like Common Crawl and crunch numbers about 'the top million domains' and come up with a conclusion like 'X% of the Internet is unusable' and you just need to know how many domains to analyze and can turn the crank and see what pops out with p < 0.05. But that's not the case. For datasets like this, you will find many parameters to be "statistically significant" as you are doing near-population-level analysis, where your sampling error is tiny and all your error will be the (unknown and usually impossible to measure) systematic error & bias which doesn't go away (although Meng 2014 is an interesting discussion of asking how much systematic error goes away when you are sampling a large fraction of the entire population). At scale, all your results may tell you is something about the many serious flaws and biases in these sorts of Internet datasets - they may be all we have, but one shouldn't fool oneself into thinking that they are any good. (As Cohen put it, a burning desire for an answer doesn't mean that a given dataset or survey methodology will be able to provide it.)