Population vs. individuallevel differences
Imagine that a correlation is found between a biomarker and the presence or absence of a disease. It is proposed that this be used for diagnosis. Can it always?
(Betteridge’s law of headlines applies.)
An artificial, idealised example
Let’s consider the following two Cauchy distributions:
It’s clearly possible to detect a difference between their respective peaks (each one’s location parameter). If we draw 10 000 samples from each, we find that the likely difference is this:
In other words, we can be quite sure that the difference is approximately one. (And I know this to be correct because this is how the two distributions were constructed.)
Those two distributions could be those of our biomarker of interest, in those with and without the disease respectively. We would therefore be able to detect that the two populations have different central tendencies.
The problem
In screening and diagnosing, however, we are faced with a different problem: having one sample at our disposal, which population did it come from? Those with the disease, or those without it?
The ingredients we need for this are an a priori probability of which distribution the sample is from (for example based on the disease’s prevalence, the patient’s symptoms, other tests…), as well as the ratio of how much more likely such a sample is under one of the distributions than under the other (the “likelihood ratio”). The recipe, then, is Bayes’ theorem: in odds form, the posterior odds that the sample belongs to the diseased population (that our patient has the disease) are the prior odds multiplied by the likelihood ratio.
You might have noticed the issue: the overlap between the two distributions is such that all samples are about as probable under either of them! The prior odds therefore can’t be nudged by much, or said differently, our measurement can’t give us a lot of information as to whether the patient has the disease or not. In this example, the greatest discriminatory power is if the sample happens to be either 117.5 or 127.5, in which case the likelihood ratio in favour of the distribution on the corresponding side will be about 1.22. Such a likelihood ratio would take a prior probability of 50% to roughly 55%; a prior of 90%, to 91.7%; or a prior of 10%, to 11.9%.
Tangent: what if the distribution's parameters are not known exactly?
(This section is not essential to the rest of the post and can be skipped if you don’t care.)
As we have said, given the set $\theta$ of parameters for the two distributions, we have:
\begin{equation*} \overbrace{O(\text{disease} \mid \text{data}, \theta)}^\text{posterior odds} = \overbrace{\frac{p(\text{data} \mid \text{disease}, \theta)}{p(\text{data} \mid \neg \text{disease}, \theta)}}^\text{likelihood ratio} \cdot \overbrace{O(\text{disease} \mid \theta)}^\text{prior odds} \end{equation*}If $\theta$ is not known exactly (say, we don’t have enough samples from patients of known status), in the Bayesian approach that we are using, we can treat it as a socalled nuisance parameter (a parameter that plays a role in inference but is not part of what we are directly interested in), and marginalise it away.^{1}^{2} Conceptually, what we would do is:
\begin{align*} p(\text{disease} \mid \text{data}) & = \int p(\text{disease}, \theta \mid \text{data})\,\mathrm d\theta \\ & = \int p(\text{disease} \mid \theta, \text{data}) \cdot p(\theta \mid \text{data})\,\mathrm d\theta \end{align*}In MCMC sampling, we can obtain the marginal posterior $p(\text{disease} \mid \text{data})$ by sampling from the joint distribution $p(\text{disease}, \theta \mid \text{data})$ and just discarding $\theta$ from the samples.^{3}
Alternatively, one could apply the exact Bayes formula to the samples for $\theta$ and thereby obtain a probability distribution for the result.
Real example
Let’s say that what we have now is not two Cauchy distributions and 10 000 samples from each, but two unknown distributions and the following samples from people known to have and not to have the disease of interest:^{4}
We also know that 10% of the general population has the disease.
We then take a measurement from a patient of unknown status and get 8.3. What does that mean?
ROC curve
We could approach this again by assuming that the two populations can be described by parametric distributions and run a joint inference of their parameters and of the disease status. But let’s put that aside – do we really expect that to be what will be done in practice? – and turn our attention to what will more likely happen: a threshold will be decided once and for all, above which the test is “negative” (no disease), and below which it is “positive” (the disease is thought to be present). If we are lucky, the outcome of that binary test will be interpreted probabilistically, but I wouldn’t bank on even that.^{5}
The process of deciding what the threshold should be typically starts by constructing a socalled “ROC curve” (Receiver Operating Characteristic), which is a curve in which each point shows the sensitivity and specificity^{6}^{7} of the test if it were to be performed at a given threshold.
For our data, it looks like this:
(Note that the curve itself doesn’t show what threshold corresponds to each point.)
Deriving a threshold
Having constructed our ROC curve, it becomes apparent that there isn’t one point that maximises all dimensions at once; we will have to balance some of them against the others. There are several ways that we could go about this, and here are a few.
Overall accuracy
We might want to use a threshold that maximises the overall “accuracy”, that is, $P(\text{disease}) \times \text{sensitivity} + P(\neg \text{disease}) \times \text{specificity}$. This will depend on $P(\text{disease})$, i.e. the prior probability of disease (or “prevalence” in frequentist terms) that we are considering, so we might want different thresholds for different population, for example screening in the general population vs. assessing a patient with symptoms.
If we are applying the test to a population in which 10% can be expected to have the disease, remember that we can achieve 90% accuracy simply by saying “no” all the time. As it turns out, the threshold that maximises accuracy in this situation (7.93) doesn’t perform much better: it achieves an accuracy of about 90.84%, by correctly detecting 100% of the healthy people (90%) and about 8.4% of those with the disease (10%). We can indeed verify that $100\% \times 90\% + 8.4\% \times 10\% \simeq 90.84\%$. We would expect the majority of its positive results to be true positives (and we could thus say that we learn enormously from a positive result, since it greatly increases the probability of disease), and about 90.76% of its negative results to be true negatives (this is not much different from the prior probability of 90% of not having the disease, so we don’t learn much from negative results, which are the majority).
If we wish to maximise accuracy in the situation where 40% of those taking the test are expected to have the disease, we find that it’s achieved by a different threshold: 8.44. The test then has a sensitivity of ~51% and a specificity of 80%, leading to an overall accuracy of about 69%. Approximately 63% of its positives will be correct (37% of the positive results will be false alarms), as will 71% of its negatives.
Maximising the overall accuracy can be seen as minimising overall cost in the special case where false positives and false negatives are considered to have equal cost. It would be possible to carry out the minimisation with different weights as well.
Guaranteed amount of evidence
Recall the odds form of Bayes’ theorem, in which the prior odds (right) are multiplied by the likelihood ratio (middle) to give the posterior odds (left):
\begin{equation*} O(\text{disease} \mid \text{data}) = \frac{p(\text{data} \mid \text{disease})}{p(\text{data} \mid \neg \text{disease})} \cdot O(\text{disease}) \end{equation*}The likelihood ratio in case of a positive result (usually referred to as the “positive likelihood ratio”, or $LR^+$) is: $\frac{\text{sensitivity}}{1  \text{specificity}}$
In case of a negative result (“negative likelihood ratio”, $LR^$), it’s: $\frac{1  \text{sensitivity}}{\text{specificity}}$.
Jaynes^{8} suggested taking the logarithm in order to express those in decibels, and calling the result “evidence”:
\begin{equation*} e(\text{disease} \mid \text{data}) = 10 \cdot \log_{10}\left(O(\text{disease} \mid \text{data})\right) \end{equation*}Odds of 1:1 (50% probability) would therefore correspond to 0 dB of evidence; odds of 10:1 (91% probability), to 10 dB of evidence; and odds of 1:10 (9% probability), to 10 dB.
Bayes’ theorem then becomes:
\begin{equation*} e(\text{disease} \mid \text{data}) = e(\text{disease}) + \underbrace{10 \cdot \log_{10}\left(\frac{p(\text{data} \mid \text{disease})}{p(\text{data} \mid \neg \text{disease})}\right)}_{\substack{\text{additional evidence} \\ \text{provided by the data}}} \end{equation*}We might then wish to maximise the minimum amount of evidence (in absolute value) returned by the test, whether it turns out positive or negative.^{9} (This would be unlike the situation above with the very low threshold making most tests negative and uninformative, and a rare few of them positive and very informative.)
This can be shown to be equivalent to making the amount of evidence equal in either case, which in turn is equivalent to making the sensitivity and specificity equal. With our data, the threshold that achieves this is 8.6, putting both sensitivity and specificity at 65%. Interestingly, because sensitivity and specificity are equal, the accuracy is also equal to them regardless of prevalence. The test then provides 2.7 dB of evidence in favour of the result (positive or negative), which corresponds to taking a 10% prior probability to either 17.1% (if positive) or 5.6% (if negative), or a 50% prior probability to either 65% or 35%.
Youden's J statistic
Also called Youden’s index, it was suggested by W. J. Youden in 1950^{10} as a means to evaluate the performance of binary tests. It is calculated simply as $\text{sensitivity} + \text{specificity}  1$ and therefore ranges from 0 for a test that is positive exactly as often in people with and without the disease, to 1 for a perfect test that identifies everyone accurately. (A test that’s positive more often in those without the disease would have a negative index but can simply be flipped around.)
In our case, that index is found to be maximised by a threshold of 8.5, yielding a sensitivity of 59% and a specificity of 74% (and therefore an index of 0.33).
The performance is still not great: a positive result (measurement ≤ 8.5) would turn a 10% probability of disease into a 20% probability, and a negative result would take it to 5.8%.
Applying the thresholds
We have found a few thresholds that we could use for our test. How does each of them fare for our measurement of 8.3 in the screening situation where the prior probability of disease is 10%?

Optimising for accuracy with a 10% prior gave a threshold of 7.93. With this threshold, our 8.3 would be considered negative, and there would be a 90.76% chance that this is correct, or a 9.24% chance that we are missing a case. (Pretty much the same as not doing the test.) It would detect slightly less than one case for each 100 people screened, with few false positives.

Optimising for accuracy with a 40% prior yielded a threshold of 8.44. With such a threshold, 8.3 would be positive, and there would be a 22.2% chance that this is correct (78% chance of false alarm).

Our “guaranteed amount of evidence” approach gave a similar threshold of 8.6, and it’s therefore not very surprising that it performs similarly: 8.3 would be positive and there would be a 17.1% chance that this is correct.

Likewise, the threshold that maximised Youden’s J statistic was 8.5, with which 8.3 would be positive and 20% likely to be a true positive.
Conclusion
As I hope this post has shown, the task of turning a measure into a test is fraught with difficulties. The mere ability to detect a difference in a given measure between those with and without a disease does not guarantee the ability to turn said measure into a useful test that can reliably distinguish between the two.
One approach not explored here (maybe in a future post?) is that of weighted logistic regression, but it would likely have the same problem as the parametric distribution approach with regard to being used in actual practice, and I don’t expect that it would perform that much better anyway. Still, it could be fun.

“Probability concepts explained: Marginalisation”, by Jonny BrooksBartlett (2018) ↩︎

“What is Bayesian statistics?”, by Sean R Eddy (2004). DOI: 10.1038/nbt09041177 ↩︎

“emcee: The MCMC Hammer”, by Daniel ForemanMackey, David W. Hogg, Dustin Lang and Jonathan Goodman (2012). DOI: 10.48550/arXiv.1202.3665 ↩︎

Data from “Choroidal and retinal thinning in chronic kidney disease independently associate with eGFR decline and are modifiable with treatment”, by Farrah, T.E., Pugh, D., Chapman, F.A. et al. (2023), figure 2B. DOI: 10.1038/s41467023431251 ↩︎

“Statistical Illiteracy in Residents: What They Do Not Learn Today Will Hurt Their Patients Tomorrow”, by Odette Wegwarth (2013). DOI: 10.4300/JGMED1300084.1 ↩︎

Sensitivity = if truly positive, probability of being detected as such; specificity = if truly negative, probability of being detected as such. ↩︎

The sensitivity of a test is sometimes called its “true positive rate”, but I find that term horrifyingly misleading, as it could be misinterpreted as the “rate of positives that are true” (which would be the positive predictive value), whereas it’s actually the “rate of positives among those who should be”, i.e. the transposed conditional. Likewise, $1  \text{specificity}$ would be called the “false positive rate” (ugh). It is possible for a test to have a “false positive rate” of 1% and yet for the vast majority of its positives to be false positives, if the test is performed mainly on true negatives, as is often the case with screening. ↩︎

“Probability Theory: The Logic of Science”, by Edwin Thompson Jaynes (2003), pp. 9296. DOI: 10.1017/CBO9780511790423 ↩︎

I don’t have a good principled justification for this approach but I thought it would be intriguing to explore it anyway. ↩︎

“Index for rating diagnostic tests”, by W. J. Youden (1950). DOI: 10.1002/10970142(1950)3:1<32::AIDCNCR2820030106>3.0.CO;23 ↩︎