Confidence intervals (CIs) are possibly one of the most misunderstood concepts in statistics, and that’s saying something.

CIs are one of many approaches to interval estimation: that is, having decided on a model that is thought to explain how the data that we are seeing are generated, how to compute an interval that represents our estimate for one of the model’s parameters, along with the uncertainty that we have about it?

Definition and interpretation

As an example, imagine that:

  1. there is an infinite discrete signal that we can observe;
  2. our model posits that each sample from the signal must have been independently generated by an underlying Cauchy distribution with unknown location parameter $\theta$ and scale parameter $\lambda$;
  3. we wish to estimate $\theta$ by drawing a few samples from the signal.

In that case, the definition of an $N\%$ confidence interval for $\theta$ is an interval produced by a procedure such that, if we repeat step 3 infinitely many times and compute such an interval each time, $N\%$ of those intervals will contain the true value of $\theta$. (This must hold true regardless of what the value of $\theta$ actually happens to be.)

This may sound like an unnecessarily convoluted way of phrasing it, but it is in fact necessary. As Greenland (2016) puts it:

A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature.

In particular, the popular interpretation that any one of such intervals has an $N\%$ chance of containing $\theta$ is incorrect:

  • under the frequentist doctrine from which confidence intervals originate, having produced a specific interval from some data, neither the interval nor $\theta$ is “random”, and therefore, whether the interval contains $\theta$ is not a matter of probability, other than 0 or 1: either the interval contains $\theta$, or it doesn’t. The phrase “the probability that this interval contains $\theta$” is not considered meaningful by frequentists.

  • Under Bayesian theory, where probability applies not only to “random” variables but more generally to uncertain ones, there is such a thing as the probability that the interval contains $\theta$, but it is not necessarily $N\%$ – that must not be assumed just because it’s an $N\%$ confidence interval. You would have to compute the probability from the data and from your prior distribution for $\theta$, and the resulting probability might happen to be $N\%$, or it might not.

Barely a confidence interval

Consider the following 50% confidence procedure, which Morey (2016) calls the “trivial interval”: we look at the first two samples, which we denote as $x_1$ and $x_2$, and construct the interval $(-\infty, +\infty)$ if $x_1 < x_2$; the empty interval otherwise.

It is clear that this example procedure satisfies the convoluted definition above (50% of the intervals it produces, when repeatedly applied to new data, contain $\theta$), but not the usual “shortcut” interpretation according to which the individual intervals it produces would each be considered to have a 50% probability of containing $\theta$.

In the words of the late Irving J. Good (1983):

One of the intentions of using confidence intervals and regions is to protect the reputation of the statistician by being right in a certain proportion of cases in the long run. Unfortunately, it sometimes leads to such absurd statements, that if one of them were made there would not be a long run.

Some people object: “well, just don’t use such silly confidence procedures, then”, but this kind of misses the point, which is that this long-run coverage property is the only one that is inherent to confidence intervals, and therefore the only one that can be assumed if all you know of an interval is that it’s a CI. Specific ways of constructing confidence intervals may well exhibit more desirable properties, but those properties need to be demonstrated, at which point the fact that they are confidence procedures is not really relevant. The point is that it’s a mistake to attribute such properties to confidence intervals in general.

I can hardly word it better than Morey (2016):

Our main point is this: confidence intervals should not be used as modern proponents suggest because this usage is not justified by confidence interval theory. The benefits that modern proponents see CIs as having are considerations outside of confidence interval theory; hence, if used in the way CI proponents suggest, CIs can provide severely misleading inferences. For many CIs, proponents have not actually explored whether the CI supports reasonable inferences or not. For this reason, we believe that appeal to CI theory is redundant in the best cases, when inferences can be justified outside CI theory, and unwise in the worst cases, when they cannot.

A parallel with medical tests

The mistake of interpreting an $N\%$ confidence interval as having an $N\%$ chance of containing the true parameter is similar to the mistake of interpreting a test’s accuracy as if it were its predictive value, i.e. not taking into account which way the test turned out.

Consider a hypothetical test, meant to detect whether you are Santa. Its performance characteristics are as follows:

  • if you are Santa, it will say “yes” with 100% probability (its sensitivity is 100%);

  • if you are not Santa, there is a 99% probability that it will correctly say “no”, but a 1% chance that it will erroneously say “yes” (its specificity is 99%).

The test’s long-run accuracy is at least 99% regardless of prevalence. That is, at least 99% of those who take the test can be expected to receive a result that matches their Santa-ness.

\begin{align*} \text{accuracy} & = P(\text{test returns correct result}) \\ & = P(\text{Santa} \text{ and } \text{test says “yes”}) + P(\text{not Santa} \text{ and } \text{test says “no”}) \\ & = P(\text{test says “yes”} | \text{Santa}) \cdot P(\text{Santa}) + P(\text{test says “no”} | \text{not Santa}) \cdot P(\text{not Santa}) \\ & = \text{sensitivity} \cdot P(\text{Santa}) + \text{specificity} \cdot P(\text{not Santa}) \\ & = 1.00 \cdot P(\text{Santa}) + 0.99 \cdot P(\text{not Santa}) \\ & = 0.99 + 0.01 \cdot P(\text{Santa}) \end{align*}

So, even if no one or almost no one is Santa, the test is still 99% accurate… but that’s mainly because of all the people who receive a “no”. That probability of 99% is not that of being or not being Santa in accordance with the test (in our analogy: the probability that the parameter $\theta$ is contained in the interval we have computed); it is the pre-test (pre-data) probability that the test result will match the Santa-ness (that the trial will generate data leading to an interval that contains $\theta$). It does not take into account how the test actually turns out.

If you take the test and receive a “yes”, it would be a mistake to say “well, the test is 99% accurate, therefore I am 99% likely to be Santa”. Taking the actual test result (/ the data that led to the confidence interval) into account, as a Bayesian would, can lead you to the opposite conclusion, i.e. that you are likely among the 1% of people who get an incorrect result (that the interval is among the $(100 - N)\%$ that don’t contain $\theta$).

Quoting Morey (2016) again:

The disconnect between frequentist theory and Bayesian theory arises from the different goals of the two theories. Frequentist theory is a “pre-data” theory. It looks forward, devising procedures that will have particular average properties in repeated sampling (Jaynes 2003; Mayo 1981; 1982) in the future (see also Neyman, 1937, p. 349). […] Any given inference may — or may not — be reasonable in light of the observed data, but this is not Neyman’s concern; he disclaims any conclusions or beliefs on the basis of data. Bayesian theory, on the other hand, is a post-data theory: a Bayesian analyst uses the information in the data to determine what is reasonable to believe, in light of the model assumptions and prior information.

Conclusion

To be entirely honest, I struggle to think of a use case for confidence intervals per se. How much use is a procedure that only guarantees that $N\%$ of the intervals it produces, when applied to new data, contain the parameter’s true value, if inspection of the data in question can reveal the specific interval that was obtained to be significantly more or less likely to contain the true value?

For less trivial examples where confidence intervals don’t exhibit the properties that are often desired of them, and more discussion of the subject in general, consider the following further readings: