Violating the independence assumption with repeated measures data: why it’s bad to ignore correlation

justinkraaijenbrink
Analytics Vidhya
Published in
9 min readAug 10, 2021

--

Photo by Chinh Le Duc on Unsplash

One of the first things that people teach you in statistics class is the linear model, where a continuous response variable, say cognitive health, is predicted from one or several covariates, such as sex and blood pressure. Well, if cognitive health is measured using some kind of questionnaire, we often find the sum or mean score to be normally distributed. This linear model is an extremely useful framework, which can be quite easily extended to more general cases with for example a binary outcome (e.g., having Alzheimer’s Disease yes or no, to stay with the cognitive health theme). A wonderful starting point to take a (deep) dive into the wonderful world of statistics! However…

…inherent to the statistics class on linear models are some rigorous assumptions. I won’t go over all the linear model assumptions, because Joos Korstanje already did a great job in explaining them very clearly, but I do want to highlight one assumption in particular: the independence of observations. The meaning of this assumption is actually pretty obvious, and to translate it to our example: we presume there is no relationship between the cognitive health scores of different subjects. During my bachelor’s degree Educational Sciences, we were taught to not bother too much about the independence assumption, since it is merely a property of the study design. I think this is illustrative for a more general practice where we often assume — for the sake of convenience and motivated by practical considerations — the assumptions to be correct. I always like to make the comparison with physics in high school, where the teacher not seldomly mentioned that we could disregard air resistance when solving gravity exercises. Now, as the title of this post already suggests, ignoring assumptions can be quite harmful! To see why, we should first dive a bit more into the topic of variability.

Variability and its sources

Variability is actually one of the cornerstones of statistics. We can very well estimate average effect sizes, but without a measure of variability these averages won’t tell us much. When we compare for example the cognitive health scores of a treatment and control group, and both groups vary considerably around the group mean, it would be quite difficult to detect a significant difference between the two groups. However, when the variation around the group means is fairly small, detecting any significant difference would be far more easy. The variability between observations comes from three different sources¹: between-subjects variability, within-subjects variability and measurement error.

Between-subjects variability Arises because individuals differ naturally. Some people tend to have higher cognitive health scores, whereas others will be more likely to report lower scores.

Within-subjects variability When subjects are measured multiple times, some inherent biological fluctuation will occur. For example, blood pressure may vary as a result of season, circadian rythm, time of day, or diet. So, some unobserved underlying processes may cause random variability within subjects.

Measurement error Results from the fact that variables are measured using certain measurement instruments, and these instruments are not perfectly precise. For example, blood pressure may be measured by different nurses at different occassions, which causes random fluctuations. Although this can be considered a separate source of variability, it is often placed among the within-subjects variability.

With the distinction between these different sources of variability, we have taken a huge step in explaining why ignoring dependence is such a bad idea. However, to obtain a truly satisfactory answer, we must first explore the concepts of between- and within-subjects effects with the help of our cognitive health example.

Between- vs. within-subject effects

Consider a study where we have two measurements, baseline and follow-up (5 years later), of cognitive health scores (response) and the covariates sex and blood pressure. We can summarize this design as follows:

y denotes the response variable, cognitive health scores

The effect of sex in this case represents the between-subjects effect, because its value is presumably stable over time. Blood pressure is what we call a within-subjects effect, because within each individual its value may fluctuate between different measurement occasions. In the absence of interaction effects, we can more formally define the between-subjects effect as:

and the within-subjects effect as:

In the next section I will show that the variance of the between-subjects effect will be underestimated when we ignore the dependence between observations, leading to p-values that are too optimistic. On the other hand, the p-values of within-subjects effects will be too conservative, because the variance will be overestimated.

Variance of B and W

Before we proceed:

WARNING: compared to the previous sections, this one is quite mathy… It practically just contains derivations to arrive at two insightful equations that illustrate how the between- and within-subjects variability depend on the correlation between responses. Fasten your seatbelt!

That having said, let’s depart! Since B is ahead of W in the alphabet, we first have a closer look at the variance of B, or in formula:

This can be expanded a bit further:

where we employed that

Now, since the measures between baseline and follow-up are very likely to be correlated (due to between-subjects variability, some subjects have higher cognitive scores compared to others), this expression can be broken down even further:

It looks like our derivations are becoming increasingly messy, but no worries! We are just one step away from simplifying the entire expression, for which we have a closer look at the covariance between the averages of two random variables:

Exploiting the general rule for covariances,

gives us:

When we assume independence and constant covariance, this simply boils down to:

But when i = j, observations Xᵢ and Yⱼ are correlated, because they are on the same subject! Then the equation above does not hold anymore, and we arrive at:

Writing the covariance in terms of correlation and standard deviations, we obtain:

Let’s assume the standard deviations to be equal for sake of convience, such that we finally have:

Well, that’s it! We can now return to our apparently complex expression, plug in the result, and see that it turns out to be all pretty understandable. We had:

where for illustrative purposes we may assume that all variances in the expression are equal. Note that we consider the variance of an average here, which requires dividing by the sample size n. We end up with:

So, to summarize, our final expression for the variance of the between-subjects effect is:

Hooray! We have successfully derived a relatively simple expression, which depends on the correlation between repeated measurements. Now we can go through more or less equivalent derivations to construct such an expression for the within-subjects effect as well. Remember how we defined the within-subjects effect:

We simply proceed as before, with the notion of one important difference: the first and second measurement are correlated! We therefore have the following expression for the variance:

Inspecting the first two terms of the right-hand side of the equation gives us:

As we have already seen before, for illustrative purposes we may assume that all variances in the expression are equal, whereby the above expression simplifies to:

This doesn’t look too complicated anymore, does it? The only thing that is left to convert to a simpler expression is the covariance between Ȳ¹ and Ȳ². Here we go:

With these derived simplified expressions, we can write the variance of the within-subjects effect as follows:

Hooray again! We once more scaled down a relatively complex expression to a much simpler one that depends on the correlation between repeated measurements. Now we have seen that correlation influences the variance estimates of both the between- and within-subjects effects, it becomes quite evident that assuming independence — or in other words, no correlation — is truly bad! To see this in action, let’s leave all those theoretical derivations behind and relieve our minds with an illustrative example and wrap up this correlation story.

Illustrative example

Remember the expressions for the between- and within-subjects variance:

If we assume observations to be independent (ρ = 0), while in fact they are not, the variance of between-subjects effects will be underestimated, leading to too small standard errors and p-values that are too optimistic (given that ρ > 0) . The variance of the within-subjects effects, on the other hand, will be overestimated, resulting in p-values that are too conservative. It goes without saying that both situations are highly undesirable! Let’s resort to an simulation example, where we model the cognitive health scores as a function of sex and blood pressure, at baseline and follow-up. I used the following R-code to generate the data and analyse the dataset, either assuming independence (lm) or taking the dependence into account (lmer):

The difference between the two analysis functions is something for another blogpost, but the most important — and striking! — thing for now is the difference in output between these two functions. Assuming independence, the estimates and corresponding standard errors we obtain are:

Estimates and SEs assuming independence

When we do take the correlation between repeated measures into account, we indeed see that the standard error for the between-subjects effect sex has been grossly underestimated, whereas the within-subjects effect of blood pressure is now almost twice as small:

Estimates and SEs taking dependence into account

These results certainly support the bold statement at the beginning of this blogpost: ignoring correlation is definitely a bad idea. And although I am not sure whether it extends to physics class in high school, my general advice would be: don’t take assumptions for granted!

References

[1] Fitzmaurice, G.M., Laird, N.M. & Ware, J.H. (2012) Applied Longitudinal Analysis. John Wiley & Sons, Hoboken, NJ.

--

--

justinkraaijenbrink
Analytics Vidhya

Statistics lover | Likes Python, loves R | Gets happy from peanut butter recipes.