How much is a representative sample?

When you check whether your soup has enough salt, provided that it’s sufficiently stirred, you don’t need to taste the whole bowl – you assume that the small amount you try contains approximately the same proportion of salt molecules as does the rest of the soup.

In a similar vein, let’s imagine that a large website wishes to determine what proportion $\theta$ of its users are bots. It samples 300 of them at random and finds that 15 of them are bots. How sure should we be of this estimate of about 5% for $\theta$?

Sampling with replacement (or from a very large population)

Sampling with replacement is akin to coin flipping, i.e. in this setup, our problem is equivalent to flipping a coin 300 times, obtaining 15 “heads” and determining the likely bias $\theta$ of the coin (with $\theta = 50\%$ being a fair coin, $\theta = 0\%$ being all “tails”, and $\theta = 100\%$ being all “heads”).

We will solve this using Bayes’ theorem (but as you’ll see in a minute, we won’t need to carry out all the calculations ourselves). Let’s denote: $D = \text{“15 ‘heads’ out of 300 flips”}$

We then have:

\begin{equation*} \overbrace{p(\theta \mid D)}^{\substack{\text{posterior} \\ \text{probability} \\ \text{density}}} = \frac{\overbrace{p(D \mid \theta)}^\text{likelihood}}{\underbrace{p(D)}_{\substack{\text{marginal} \\ \text{probability}}}} \cdot \overbrace{p(\theta)}^{\substack{\text{prior} \\ \text{probability} \\ \text{density}}} \end{equation*}

Where the likelihood of our outcome $D$ is:

\begin{equation*} p(D \mid \theta) = {300 \choose 15} \cdot \theta^{15} \cdot (1 - \theta)^{300 - 15} \end{equation*}

Which is the probability mass function of a binomial distribution with parameters $(n = 300, p = \theta)$, evaluated at $k = 15$.

Conjugate priors

It just so happens that when the likelihood and prior have certain forms, the posterior distribution can be expressed in terms of the prior distribution with changed parameters. (See Wikipedia’s table.)

One such case is when the likelihood is binomial (as it is for us) with parameters $n$ (number of trials) and $k$ (number of “successes”) and the prior is a beta distribution with parameters $\alpha$ and $\beta$. The posterior is then also a beta distribution, with parameters $\alpha + k$ and $\beta + (n - k)$.

If our prior probability distribution for $\theta$ is that all possible values are equally likely, we can express that as a uniform distribution between 0 and 1, but we can also, more conveniently for our purposes, express it as a beta distribution $\operatorname{Beta}(1, 1)$.

Our posterior distribution, then, is $\operatorname{Beta}(1 + 15, 1 + (300 - 15))$, i.e. $\operatorname{Beta}(16, 286)$, which looks like this:

Plot of the posterior probability distribution for the proportion of bots, showing a moderately sharp peak around 5%

That is, with a uniform prior on $\theta$ (which may or may not be reasonable depending on what is actually known), observing 15 “heads” out of 300 flips (or 15 bots out of 300 users) should make us 95% sure that $\theta$ is between 2.9% and 7.9%.

Sampling without replacement

If we sample without replacement from a finite population, intuition suggests that we should be even more sure about our result – indeed, in the limit where we have sampled every single element, we are fully certain about it. Even before that point, having sampled 9 out of all 10 users of a website (it’s not a very big website) and found 8 of them to be real, the proportion of real users can only be either 80% (if the last one is not real) or 90% (if it is, which seems likely in light of the previous outcomes). Let’s quantify this.

Let’s denote the population size as $N$, and the (unknown) total number of “successes” in it as $K$. We sample $n$ members without replacement, and find $k$ successes among them.

The likelihood for this outcome is described by the so-called “hypergeometric” distribution, with parameters $(N, K, n)$, evaluated at $k$.

The conjugate prior, then, is that if our prior for $K$ is a beta-binomial distribution with parameters $(N, \alpha, \beta)$, i.e. a binomial distribution with $N$ trials in which the probability of success is itself a beta distribution with parameters $(\alpha, \beta)$, then, having sampled $n$ members and observed $k$ successes, the posterior distribution for the number of successes among the $(N - n)$ not-yet-sampled members is a beta-binomial distribution with parameters $(N - n, \alpha + k, \beta + (n - k))$. If we want the total number of successes $K$, we then need to add the $k$ successes that we observed. Finally, if we want the proportion of successes, and not their absolute number, we can divide that by $N$.

So, if we now repeat our original scenario, in which we sample 300 users and find that 15 of them are bots, but we now assume the sampling to have taken place without replacement out of 500 total users, and our a priori expectation is that we considered all possible numbers of bots as equally plausible ($\operatorname{BetaBin}(500, 1, 1)$), then we find that the plausible total number of bots out of 500 is described by $15 + \operatorname{BetaBin}(500 - 300, 1 + 15, 1 + (300 - 15))$, i.e. $15 + \operatorname{BetaBin}(200, 16, 286)$, which looks like this:

Plot of the posterior probability distribution for the absolute number of bots, showing a moderate peak around 25

(Strictly speaking, the probability is non-zero up to and including 215, because it is theoretically possible that all remaining users are bots, but it quickly becomes vanishingly implausible given the previous data and the prior information.)

If we then divide this by 500, the total number of users, we find the probability mass function for the various discrete values that the proportion could take:

Plot of the posterior probability distribution for the proportion, showing it to be narrower than in the case of sampling with replacement

Superimposed is the shape of the (continuous) probability distribution that we obtained when we assumed sampling with replacement or from a huge population.

In this case, knowing that we were sampling without replacement from a population of 500 has reduced the uncertainty in our estimate of $\theta$, with 95% of the probability mass now being encompassed by just $[3.6\%, 6.6\%]$ instead of requiring $[2.9\%, 7.9\%]$.

As we increase $N$ (the size of the population), the uncertainty increases and the estimate approaches that obtained using sampling with replacement. Here is $N = 1000$:

Plot of the posterior probability distribution for the proportion if the population size is 1000

And here is 5000:

Plot of the posterior probability distribution for the proportion if the population size is 5000; the distribution starts to approach the continuous one

As can be seen, for a fixed number of samples, even though the uncertainty on the absolute number of successes in the whole population grows unboundedly as the number of unsampled members increases, the uncertainty on the relative proportion doesn’t; it hits a limit.

Sampling bias

The previous discussion assumes the absence of sampling bias, i.e. it assumes that a “success” and a “failure” have equal chances of being considered for sampling. (Not that we are equally likely to sample either.)

Sampling bias, or selection bias, occurs when this assumption is not true, either from direct causation (e.g. someone with an unsatisfactory experience may be more likely to go out of their way to write a review) or from other forms of correlation. (This is what happens when the soup mentioned at the start of the post is not sufficiently stirred.)

To some extent, we can try to account for this in our estimates. Let’s look at the case of sampling with replacement / from a huge population. Suppose that each sample that we could potentially get has an unknown probability $\gamma_\text{success}$ of making it to the sampling process if it’s a success, or $\gamma_\text{failure}$ if it’s a failure.

That means that the final probability for a sample we do see to be a success is:

$$\frac{\gamma_\text{success} \cdot \theta}{\gamma_\text{success} \cdot \theta + \gamma_\text{failure} \cdot (1 - \theta)}$$

Or, if we denote the ratio $\frac{\gamma_\text{success}}{\gamma_\text{failure}}$ as $r_\gamma$:

$$\frac{\theta}{\theta + \frac{1 - \theta}{r_\gamma}}$$

If this ratio is known to be 1 then this “transformed” probability of success is simply $\theta$. If, on the other hand, we intend to express “complete ignorance” of $r_\gamma$, and choose to do so by using Jeffreys’ prior for scale parameters (that is, the improper prior $p(r_\gamma) \propto \frac{1}{r_\gamma}$, or equivalently, a flat prior on $\log r_\gamma$), and then conduct inference with a PyMC program like the following:

import pymc as pm

with pm.Model() as model:
    log_rγ = pm.Flat('log_rγ')
    rγ = pm.Deterministic('rγ', pm.math.exp(log_rγ))
    
    θ = pm.Beta('θ', 1, 1)
    biased_θ = pm.Deterministic('biased_θ', θ / (θ + (1 - θ) / rγ))
    
    pm.Binomial('observed', p=biased_θ, n=300, observed=15)
    
    trace = pm.sample(100_000, nuts_sampler='numpyro', cores=16)

We find that the posterior probability distribution for $\theta$ is the same as its prior. In other words, we have learned nothing about it, which makes sense.

Having confirmed that the model seems to behave as expected in those extreme cases, we can experiment with intermediate cases. For example, let’s see what happens if we consider it 95% probable that $r_\gamma$ deviates from a “fair” ratio of 1 by up to 10% in either direction. We will express this by a Gaussian prior on $\log_{10}(r_\gamma)$ with mean 0 and standard deviation $\frac{\log_{10}(1.1)}{\sqrt 2 \operatorname{erf}^{-1}(0.95)} \simeq 0.0211191$.

Here is what we get if we observe 15 successes out of 300 samples:

Posterior probability distribution for the biased and true proportions with 300 samples

In this case, our estimate of the “true” $\theta$ (left) is roughly as uncertain as that with the bias (right).

If we now observe 250 successes out of 5000 (still 5%):

Posterior probability distribution for the biased and true proportions with 5000 samples

Our estimate of the biased $\theta$ is becoming more and more certain, and our estimate of the true $\theta$ is still only slightly less so. But now, with 500 000 samples:

Posterior probability distribution for the biased and true proportions with 500000 samples

We begin to see that while we can be quite sure of the value of the biased proportion, our “true” estimate is starting to hit a ceiling – our 95% credible interval still leaves about 10% of relative uncertainty about the true value.

It therefore seems that this is how uncertainty about the amount of sampling bias manifests itself.

Conclusion

This post was motivated by my impression that many people overestimate the importance of the size of the population in relation to the number of samples. I hope to have contributed to demystifying the question by showing what impact it actually has.