3 min read

Correlation Between Groups and Predictors

In this, my first post, I discuss the paper by Bafumi and Gelman. This paper explains why it is problematic if predictors and group effects correlate when doing mixed models, eg lmer, and they also propose a solution.

I recently read the paper, and the code they used to create their plots wasn’t readily avaible to me (or, at least, I couldn’t find it). So, I decided to write my own and post it here, in case others are also interested.

The basic idea is that correlation between predictors and group effects can artificially decrease p-values associated with the predictor, leading to effects that seem more significant than they are. Here are two examples:

  1. In this example, we have a predictor, response and grouping. There is no correlation between the grouping and the predictor, and there is a relationship between the predictor and the response.
library(lme4)
set.seed(2142018)
no_corr <- replicate(1000, {
  pred <- rnorm(100, 0, 2)
  out <- pred + rnorm(100, 0, 7)
  out <- c(rnorm(25,1,.001), rnorm(25, -1, .001), rnorm(25, 3, .001) , rnorm(25, 2, .001)) + out
  df <- data.frame(pred = pred, out = out, units = factor(c(rep(x = 1,25), rep(2, 25), rep(3, 25), rep(4, 25))))
  summary(invisible(lmer(out~(1|units) + pred, data = df)))$coefficients[2,3]
})
hist(no_corr)

  2. Next, we see what happens when there is a correlation between the groups and the predictor. Everything else remains the same.

yes_corr <- replicate(1000, {
  pred <- rnorm(100, 0, 2) 
  out <- pred + rnorm(100, 0, 7)
  pred <- pred +  (c(rnorm(25,1,.001), rnorm(25, -1, .001), rnorm(25, 3, .001) , rnorm(25, 2, .001)))
  out <-  (c(rnorm(25,1,.001), rnorm(25, -1, .001), rnorm(25, 3, .001) , rnorm(25, 2, .001))) + out
  df <- data.frame(pred = pred, out = out, units = factor(c(rep(x = 1,25), rep(2, 25), rep(3, 25), rep(4, 25))))
  summary(lmer(out~(1|units) + pred, data = df))$coefficients[2,3]
})

Now, we compare the histograms of test statistics associated with the slope of the predictor variable in the two scenarios. As you can see, there is a difference in the two histograms, and the correlated version has significantly higher test statistcs overall.

corr <- data.frame(no_corr = no_corr,
                   yes_corr = yes_corr)
library(ggplot2)
library(dplyr)
tidyr::gather(corr, key = "corr", value = "t_stat") %>% 
  ggplot(aes(x = t_stat)) + 
  geom_histogram() + 
  facet_wrap(~corr)

Well, the histograms may not be that convincing, but the means are.

mean(no_corr)
## [1] 2.879333
mean(yes_corr)
## [1] 3.509314
t.test(no_corr, yes_corr)
## 
##  Welch Two Sample t-test
## 
## data:  no_corr and yes_corr
## t = -13.39, df = 1996.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7222483 -0.5377140
## sample estimates:
## mean of x mean of y 
##  2.879333  3.509314