After our recent exploration of the pitfalls of confidence intervals, how about a little foray into the world of P-values?

(Note: the term is variably typeset with or without a hyphen, with the P capitalised or not, and in italics or straight.)

Introduction: frequentism and Bayesianism

P-values, like confidence intervals, are a frequentist concept. Frequentism is one of two major schools of thoughts in statistics, the other being Bayesianism (named after Thomas Bayes). Whereas Bayesians use probability distributions to quantify uncertainty in general (and update them using Bayes’ theorem as new information comes to light), frequentists restrict probability to specifically mean relative frequencies in infinite repetitions of a “trial” where nevertheless some quantities of interest can vary from one repetition to the next. Those quantities are called “random”.

So, for example, while a Bayesian would have no issue with assigning a probability (conditional on the information at hand) of 50% to the possibility that flipping an unfamiliar coin would turn up “heads”, a frequentist would only feel justified in doing so if it could be assumed that in infinitely many flips, the coin would come up “heads” 50% of the time.

Note that the Bayesian’s assignment of 50% to the outcome of the flip is not a statement about that long-run frequency. The Bayesian could very well discuss this relative frequency but would treat it as an uncertain quantity, subject to its own probability distribution, itself subject to change as new information is obtained (for example by flipping the coin ten times). A probability of 50% for the outcome of a single flip can be the result not only of being sure that the relative frequency is 50%, but also of assigning equal probability to all possible relative frequencies between 0% and 100%, or assigning 50% probability to the possibility that the coin is biased to always come up “heads” and 50% probability to the possibility that it is biased the other way.

Frequentists’ contention with Bayesian methods is not with Bayes’ theorem itself (as long as it’s applied to frequencies), but with epistemic probability and with the requirement to be able to assign a probability distribution to uncertain quantities before any data is available (a so-called prior probability distribution), which they believe makes such use of probability “subjective”.

Note: there are several subphilosophies among frequentists. Ronald A. Fisher, who popularised the P-value, and Jerzy Neyman, inventor of the confidence interval, rarely saw eye-to-eye, and various disputes between the two of them can be found in the literature.

With that out of the way, let’s focus on P-values specifically.

What is a P-value?

Having obtained some data in a random trial, a P-value is the relative frequency at which, if we repeated the trial infinitely many times, we would obtain data more “extreme” according to a statistic of our choice, under the assumption that our null hypothesis is true. But what does that actually mean?

Imagine that we have a coin, and that each trial consists in flipping it ten times. Our model is that each flip has a constant probability $\theta$ of being “heads”, our null hypothesis “$H_0$” is that the coin is fair ($\theta = 0.5$), and we happen to get 9 “heads”. What is the P-value?

A result at least as “extreme” as getting 9 heads out of 10 flips would be either:

  • getting 9 or 10 heads if performing a so-called “one-tailed test”, or
  • getting 0, 1, 9 or 10 heads in the case of a two-tailed test.

Since the probability of getting “heads” in a single flip is $\theta$, the number of “heads” in 10 flips follows the binomial distribution $B(10, \theta)$. The probability of getting exactly $k$ “heads” (in any order) in 10 flips is therefore ${10 \choose k} \cdot \theta^k \cdot (1 - \theta)^{10 - k}$. Under our null hypothesis that $\theta = 0.5$, it looks like this:

Bar plot showing that the probability mass function for the number of “heads” in 10 fair coin flips is heavy around 5; less so near 0 and 10

And the probability of getting a result at least as “extreme” as 9 heads in the one-tailed case is thus:

\begin{align*} p & = P(9\text{ heads} | \theta=0.5) + P(10\text{ heads} | \theta=0.5) \\ & \approx 0.0107 \end{align*}

In the two-tailed case:

\begin{align*} p & = P(\text{0, 1, 9 or 10 heads} | \theta=0.5) \\ & \approx 0.0215 \end{align*}

Those are the P-values for our trial if the null hypothesis is that the coin is fair and our statistic of interest is the number of “heads”.

Because those P-values are less than 5%, the results would be called “statistically significant” at the 5% level (it is possible to choose other thresholds). That term is widely known, but seemingly rarely understood.

What does it mean?

Arguably, not all that much. In the words of Ronald A. Fisher (1959), a statistically significant result indicates that “either an exceptionally rare chance has occurred or the theory is not true”. But how are we to distinguish between the two?

As he said, such a result is one that would be infrequent and therefore surprising if the null hypothesis were correct (say, if our coin really is fair). But the null hypothesis being true and an infrequent result having occurred might still be less surprising than the alternative hypothesis being true and the result being perhaps less infrequent (but possibly not much less if the experiment is underpowered). Thus, in interpreting a P-value, there is no escaping the idea of a priori plausibility that frequentists tend to shy away from. Colquhoun (2014) does a good job of explaining this, albeit with a section dedicated to not sounding Bayesian in doing so.

A few properties of P-values

  • Under the null hypothesis, P-values are uniformly distributed by construction, since they correspond to the quantile of extremeness (so to speak), under the null hypothesis, of the data obtained. As noted by Dienes (2011), this is in contrast to Bayes factors, which are driven towards zero under the null hypothesis.

  • Since the definition of the P-value is based on the probability of getting data “at least as extreme”, it is sensitive to stopping rules, because they affect the set of more extreme data that you could get if things were to play out differently. Goodman (1999) provides a good example, and notes:

    This problem also has implications for the design of experiments. Because frequentist inference requires the “long run” to be unambiguous, frequentist designs need to be rigid (for example, requiring fixed sample sizes and pre-specified stopping rules), features that many regard as requirements of science rather than as artifacts of a particular inferential philosophy.

    Kruschke (2013) goes further:

    Importantly, the space of all possible $t_\text{null}$ values that might have been observed is defined by how the data were intended to be sampled. […]

    It is important to recognize that NHST [null hypothesis significance testing] cannot be salvaged by attempting to fix or set the sampling intention explicitly in advance. For example, consider two researchers who are interested in the effects of a smart drug on IQ. They collect data from identical conditions. The first researcher obtains the data shown in Figure 3. The second researcher happens to obtain identical data (or at least data with identical $t_\text{obs}$ and $N_\text{obs}$). Should the conclusions of the researchers be the same? Common sense, and scientific acumen, suggests that the conclusions should indeed be the same because the data are the same. But NHST says no, the conclusions should be different, because, it turns out, the first researcher collected the data with the explicit intention to stop at the end of the week and compare with another group of data to be collected the next week, and the second researcher collected the data with the explicit intention to stop when a threshold sample size of $N_1 = N_2 = 50$ was achieved but had an unexpected interruption.

    […]

    There is no reason to base statistical significance on whether the experimenter intended to stop collecting data when $N = 47$ or when the clock reached 5:00 p.m.

  • “Statistically significant” is not the magic incantation that it’s sometimes made out to be. It means that “under the model being considered, such data would be somewhat infrequent if the null hypothesis were true”. That’s all.

  • Relatedly, a misconception that one occasionally sees is the idea that the P-value is the probability that the data were “due to chance alone”. But this would be equivalent to saying that it’s the probability that the null hypothesis is true, i.e. that it’s $P(H_0 \mid \text{data})$ (a Bayesian interpretation – which should be a hint that that’s not what it means), which is quite different from what it really is: the probability, assuming that the null hypothesis is true, that one would see such data, i.e. $P(\text{data as extreme} \mid H_0)$ (if we were to perform the trial infinitely many times).

  • Likewise, be wary of attempts to explain P-values in terms of error rates. It is sometimes claimed that the P-value is the probability of being wrong if you reject the null hypothesis, but that’s just yet another restatement of the Bayesian interpretation that P-values don’t have (the probability, given that you reject the null hypothesis, of the null hypothesis being true after all). Indeed, if the null hypothesis is true, then 100% of its rejections are wrong. If you throw a 20-faced die that you are quite sure is fair, and you reject the null hypothesis that it is fair whenever you land a 20, are you right 95% of the time you do that?

Conclusion and further reading

I don’t wish to presume too much about the understanding you had of P-values before reading this, but in case there was anything unclear to you about them, I hope that this has shed some light on the concept and some of its pitfalls.

If you’d like to explore the subject further, I can recommend the following articles: