TL;DR: The geepack R package doesn’t do what it doesn’t claim to do

Introduction

Running many generalized estimating equation (GEE) models can be quite time consuming when at least one of the clusters has a lot of subjects, as the time complexity of the GEE model is \(O(n^3)\)¹ due to the full matrix multiplication required for computing the correlation matrix.

¹ More like \(O(m^3)\) where \(m\) is the size of largest cluster when the clusters vary in terms of size.

For relatively simple models – where there are a few categorical predictors, the continuous variables are coarsened, and the response variable is binary – there is a computational trick that can be employed while fitting logistic regression models where the raw data are aggregated and fed into the model. An example of this approach can be found here.

Does this approach work – in terms of providing the same point estimates and standard errors – for GEE models as well? Based on an example dataset as well as a small simulation study, the answer seems to be ‘it depends’.

Real dataset

The first comparison is based on the ohio dataset from the {geepack} R package. This kind of setting where there are very many small clusters (537 clusters with 4 observations per cluster) is where GEEs usually shine.

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(furrr)

Loading required package: future

library(geepack)

theme_set(theme_bw())

data("ohio")

ohio <- as_tibble(ohio)

glimpse(ohio)

Rows: 2,148
Columns: 4
$ resp  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ id    <int> 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5…
$ age   <int> -2, -1, 0, 1, -2, -1, 0, 1, -2, -1, 0, 1, -2, -1, 0, 1, -2, -1, …
$ smoke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Fitting a model to the full dataset results in the following output

mod1 <- geeglm(
  formula = resp ~ smoke, data = ohio, family = "binomial",
  id = id, corstr = "exchangeable"
)

summary(mod1)


Call:
geeglm(formula = resp ~ smoke, family = "binomial", data = ohio, 
    id = id, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err    Wald Pr(>|W|)    
(Intercept)  -1.8212  0.1099 274.527   <2e-16 ***
smoke         0.2716  0.1776   2.338    0.126    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)        1  0.1103
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha   0.3513 0.06163
Number of clusters:   537  Maximum cluster size: 4

Fitting the same model to an aggregated version of the same dataset results in identical estimates

agg_df <- ohio %>%
  group_by(id, smoke) %>%
  summarize(
    n = n(),
    resp = sum(resp),
    .groups = "drop"
  ) %>%
  mutate(resp_p = resp / n)

mod2 <- geeglm(
  formula = resp_p ~ smoke,
  weights = n, data = agg_df,
  family = "binomial", id = id, corstr = "exchangeable"
)

summary(mod2)


Call:
geeglm(formula = resp_p ~ smoke, family = "binomial", data = agg_df, 
    weights = n, id = id, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err   Wald Pr(>|W|)    
(Intercept)   -1.821   0.110 274.53   <2e-16 ***
smoke          0.272   0.178   2.34     0.13    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    0.513  0.0532
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha        0       0
Number of clusters:   537  Maximum cluster size: 1

So the estimated scale, correlation, and the maximum cluster size statistics are wrong in the second case, but the parameter estimates for the intercept and slope coefficients as well as the corresponding standard errors are nearly identical, which can be compared more easily in this data frame that combines the estimates from both models

bind_rows(
  broom::tidy(mod1) %>% mutate(data = "raw data"),
  broom::tidy(mod2) %>% mutate(data = "aggregated data")
) %>%
  arrange(term)

# A tibble: 4 × 6
  term        estimate std.error statistic p.value data           
  <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>          
1 (Intercept)   -1.82      0.110    275.     0     raw data       
2 (Intercept)   -1.82      0.110    275.     0     aggregated data
3 smoke          0.272     0.178      2.34   0.126 raw data       
4 smoke          0.272     0.178      2.34   0.126 aggregated data

But does this also work in settings where there are a small number of large clusters, which is where the GEE calculations take a long time to run?

Simulations

Model

The code below simulates data from the following mixed-effects model² with random-intercept terms where person \(j\) is nested within cluster \(i\). Each cluster has a different proportion of treatment uptake \(p_{treated,j}\) varying between 40-60% where \(X_{ij}\) indicates whether person \(j\) in cluster \(i\) received the treatment or not.

² The model syntax is inspired by the statistical rethinking book

\[ \begin{gather} Y_{ij} \sim \text{Bernoulli}(p_{ij}) \\ \text{logit}(p_{ij}) = -1.4 + 0.3 \times X_{ij} + \alpha_i \\ \alpha_i \sim \text{Normal}(0, 1) \\ X_{ij} \sim \text{Bernoulli}(p_{treated,i}) \\ p_{treated,i} \sim \text{Uniform(0.4, 0.6)} \end{gather} \]

The time taken to train the model on the raw as well as the aggregate data, the point estimates, and standard errors for the coefficients are stored for analysis.

Code

simulate_data_and_get_estimates <- function(n, n_clus, seed) {
  set.seed(seed)
  # cluster random intercept
  rand_intercept <- rnorm(n_clus, mean = 0, sd = 1)

  set.seed(seed)
  sim_data <- map_dfr(.x = 1:n_clus, .f = ~ {
    tibble(
      id = .x,
      x = rbinom(round(n / n_clus), size = 1, prob = runif(1, 0.4, 0.6)),
      y = rbinom(round(n / n_clus), size = 1, prob = plogis(-1.4 + 0.3 * x + rand_intercept[[.x]]))
    )
  })

  # fit model to raw data
  t0 <- Sys.time()
  mod1_sim <- geeglm(
    y ~ x,
    data = sim_data,
    family = "binomial", id = id, corstr = "exchangeable"
  )
  t1 <- Sys.time()
  t1 <- as.numeric(difftime(t1, t0, units = "secs"))

  # fit GLM model to raw data
  mod1_sim_glm <- glm(y ~ factor(id) + x - 1,
                      data = sim_data, family = "binomial")

  # fit model to aggregated data
  agg_data <- sim_data %>%
    group_by(id, x) %>%
    summarize(n = n(), y = sum(y), .groups = "drop") %>%
    mutate(y_prob = y / n)

  t2 <- Sys.time()
  mod2_sim <- geeglm(
    y_prob ~ x,
    data = agg_data, weights = n,
    family = "binomial", id = id, corstr = "exchangeable"
  )
  t3 <- Sys.time()
  t3 <- as.numeric(difftime(t3, t2, units = "secs"))

  # fit GLM model to aggregated data
  mod2_sim_glm <- glm(y_prob ~ factor(id) + x - 1, data = agg_data,
                      weights = n, family = "binomial")

  results <- bind_rows(
    # GEE results
    broom::tidy(mod1_sim) %>%
      mutate(data = 'raw data', time_secs = t1, estimator = "GEE"),
    broom::tidy(mod2_sim) %>%
      mutate(data = 'aggregated data', time_secs = t3, estimator = "GEE"),
    # GLM results
    broom::tidy(mod1_sim_glm) %>%
      mutate(data = 'raw data', time_secs = NA_real_, estimator = "GLM"),
    broom::tidy(mod2_sim_glm) %>%
      mutate(data = 'aggregated data', time_secs = NA_real_, estimator = "GLM")
  ) %>%
    mutate(n = n, n_clus = n_clus, seed = seed) %>%
    filter(!stringr::str_detect(term, pattern = "id"))

  return(results)
}

Here’s what the output from this function looks like:

simulate_data_and_get_estimates(n = 100, n_clus = 5, seed = 3)

# A tibble: 6 × 11
  term       estim…¹ std.e…² stati…³ p.value data  time_s…⁴ estim…⁵     n n_clus
  <chr>        <dbl>   <dbl>   <dbl>   <dbl> <chr>    <dbl> <chr>   <dbl>  <dbl>
1 (Intercep…  -1.59    0.409 15.0    1.07e-4 raw …  0.00748 GEE       100      5
2 x           -0.144   0.383  0.141  7.07e-1 raw …  0.00748 GEE       100      5
3 (Intercep…  -1.73    0.395 19.1    1.27e-5 aggr…  0.00326 GEE       100      5
4 x           -0.107   0.402  0.0702 7.91e-1 aggr…  0.00326 GEE       100      5
5 x           -0.266   0.596 -0.446  6.55e-1 raw … NA       GLM       100      5
6 x           -0.266   0.596 -0.446  6.55e-1 aggr… NA       GLM       100      5
# … with 1 more variable: seed <dbl>, and abbreviated variable names ¹estimate,
#   ²std.error, ³statistic, ⁴time_secs, ⁵estimator

simulation_parameters <- expand_grid(
  n = c(100, 500, 1000, 2500, 5000, 7500, 10000),
  n_clus = c(5, 50, 250),
  seed = 1:20
) %>% 
  # remove runs where number of clusters is larger than total sample size
  filter(n_clus < n)

# set up multicore processing
plan(multisession, workers = 15)

# parallelize purrr::pmap_dfr by using the {furrr} package
simulation_results <- future_pmap_dfr(
  .l = simulation_parameters,
  # have to pass args as ..1 or ..2, else it fails
  .f = ~ simulate_data_and_get_estimates(n = ..1, n_clus = ..2, seed = ..3),
  .options = furrr_options(seed = NULL)
)

plan(sequential) # turn off multicore

This chunk can take a while to run for some of the parameter combinations, so pre-computed results are loaded and analyzed further.

simulation_results <- read_csv(file = "gee_glm_sim.csv", show_col_types = FALSE) %>% 
  mutate(
    n_clus_raw = n_clus,
    n_clus = factor(paste0("# clusters: ", n_clus),
                    levels = paste0("# clusters: ", c(5, 50, 250))),
    term = case_when(
      term == "(Intercept)" ~ "Intercept",
      term == "x" ~ "Slope", 
      TRUE ~ term
  ))

glimpse(simulation_results)

Rows: 2,400
Columns: 12
$ term       <chr> "Intercept", "Slope", "Intercept", "Slope", "Slope", "Slope…
$ estimate   <dbl> -7.17e-01, -7.38e-02, -5.35e-01, -1.14e-01, -5.92e-02, -5.9…
$ std.error  <dbl> 4.69e-01, 1.75e-01, 4.09e-01, 1.53e-01, 4.86e-01, 4.86e-01,…
$ statistic  <dbl> 2.3402, 0.1772, 1.7056, 0.5518, -0.1216, -0.1216, 7.1290, 4…
$ p.value    <dbl> 1.26e-01, 6.74e-01, 1.92e-01, 4.58e-01, 9.03e-01, 9.03e-01,…
$ data       <chr> "raw data", "raw data", "aggregated data", "aggregated data…
$ time_secs  <dbl> 0.01389, 0.01389, 0.00420, 0.00420, NA, NA, 0.00928, 0.0092…
$ estimator  <chr> "GEE", "GEE", "GEE", "GEE", "GLM", "GLM", "GEE", "GEE", "GE…
$ n          <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,…
$ n_clus     <fct> # clusters: 5, # clusters: 5, # clusters: 5, # clusters: 5,…
$ seed       <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,…
$ n_clus_raw <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…

Analysis

Computation time

First, the time taken to fit the model to the raw dataset is shown as a function of total sample size \(n\) where there are \(n / n_{clus}\) observations per cluster. Each panel has a different y-axis to allow reading values directly from the plot.

simulation_results %>%
  filter(data == "raw data", estimator == "GEE") %>%
  # distinct here since the data frame contains a row for
  # each of the intercept and slope terms but the runtimes are the same
  # for both these terms
  distinct(seed, time_secs, n, n_clus) %>%
  ggplot(aes(x = n, y = time_secs, group = n_clus)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, degree = 3)) +
  geom_smooth(method = "lm", formula = y ~ I(x ^ 3), color = "forestgreen") +
  stat_summary(aes(group = interaction(n_clus, n)), geom = "point", 
               fun = mean, color = "darkorange", size = 2) + 
  xlab("Total sample size (n)") +
  ylab("Runtime (in seconds)") +
  facet_wrap(~ n_clus, scales = "free")

The blue line fits a cubic polynomial \[time = \beta_0 + \beta_1 n + \beta_2 n^2 + \beta_3 n^3 + \varepsilon\] to the run times as a function of sample size. The green line fits the same model but without the linear and quadratic terms, i.e, \(time = \alpha_0 + \alpha_1 n^3 + \varepsilon\). The yellow points are the mean run times for each level of \(n\) within each panel.

When the number of clusters is very small, an increase in the sample size within the largest cluster leads to a cubic increase in expected run times. The closeness of the blue and green lines in the leftmost panel indicates that the cubic term in the execution time model dominates the lower order terms. In the second and third panels, the \(O(n^3)\) limit hasn’t hit yet, as the green and the blue lines are in disagreement.

For 10000 samples split into 2000 samples in each of the five clusters, it takes approximately 15 minutes per run. The same total sample size split into 250 samples in each of the 40 clusters, each model takes about 7 seconds to run, and even less time when there are 250 clusters. This is a massive difference in time between the different settings.

Convergence issues

However, unfortunately some of the models run into convergence issues, where the point estimates and the standard errors explode

simulation_results %>% 
  filter(estimator == "GEE") %>% 
  arrange(desc(std.error)) %>% 
  head(n = 10)

# A tibble: 10 × 12
   term      estimate std.e…¹ stati…² p.value data  time_…³ estim…⁴     n n_clus
   <chr>        <dbl>   <dbl>   <dbl>   <dbl> <chr>   <dbl> <chr>   <dbl> <fct> 
 1 Intercept  4.24e16 1.22e16  12.1   5.12e-4 aggr… 0.00516 GEE     10000 # clu…
 2 Intercept  1.13e17 8.91e15 161.    0       aggr… 0.00621 GEE      7500 # clu…
 3 Intercept -1.99e17 7.88e15 641.    0       aggr… 0.0175  GEE      1000 # clu…
 4 Intercept  2.32e15 7.25e15   0.102 7.49e-1 aggr… 0.00667 GEE      7500 # clu…
 5 Intercept  5.32e15 6.59e15   0.650 4.20e-1 aggr… 0.00582 GEE       100 # clu…
 6 Slope     -1.69e15 2.25e15   0.562 4.53e-1 aggr… 0.0110  GEE      2500 # clu…
 7 Intercept  9.69e14 2.12e15   0.210 6.47e-1 aggr… 0.00558 GEE     10000 # clu…
 8 Slope     -1.03e15 2.09e15   0.241 6.23e-1 aggr… 0.00763 GEE      7500 # clu…
 9 Slope     -1.94e15 2.05e15   0.896 3.44e-1 aggr… 0.00503 GEE     10000 # clu…
10 Slope      3.82e15 1.88e15   4.13  4.23e-2 aggr… 0.00817 GEE      5000 # clu…
# … with 2 more variables: seed <dbl>, n_clus_raw <dbl>, and abbreviated
#   variable names ¹std.error, ²statistic, ³time_secs, ⁴estimator

This seems to be a problem when training GEE models on aggregated data when the number of clusters is low irrespective of the total sample size, which leads to about 21% of the estimates from the aggregated data being convergence failures. An estimate is flagged as convergence failure when the stdandard error of the coefficients > 5 on the logit scale.

simulation_results <- simulation_results %>%
  mutate(high_se = as.numeric(std.error > 5)) 

simulation_results %>%
  filter(estimator == "GEE") %>% 
  select(-estimate, -std.error, -statistic, -p.value, -time_secs) %>%
  # put intercept and slope estimates in one row
  pivot_wider(everything(), names_from = term, values_from = high_se) %>%
  # if at least one of the intercept or slope terms have a very high se
  mutate(high_se = pmin(Intercept + Slope, 1)) %>%
  group_by(data, n_clus, n) %>%
  summarize(n_total = n(), n_failed = sum(high_se), .groups = "drop_last") %>%
  #print(n = Inf) %>%
  summarize(n_total = sum(n_total), n_failed = sum(n_failed), .groups = "drop") %>%
  mutate(percent_failed = 100 * n_failed / n_total)

# A tibble: 6 × 5
  data            n_clus          n_total n_failed percent_failed
  <chr>           <fct>             <int>    <dbl>          <dbl>
1 aggregated data # clusters: 5       140       29           20.7
2 aggregated data # clusters: 50      140        0            0  
3 aggregated data # clusters: 250     120        0            0  
4 raw data        # clusters: 5       140        0            0  
5 raw data        # clusters: 50      140        0            0  
6 raw data        # clusters: 250     120        0            0

Raw vs aggregated data estimates

These rows with convergence failures can be removed and the agreement between the estimates from the raw vs the aggregate datasets can be assessed visually

slope_plot_data <- simulation_results %>%
  filter(term == "Slope", high_se == 0) %>%
  select(seed, term, data, estimate, n_clus, n, estimator) %>%
  tidyr::pivot_wider(id_cols = c(seed, term, n_clus, n, estimator),
                     names_from = data, values_from = estimate) %>%
  mutate(`Total sample size` = factor(n), 
         id = interaction(estimator, n_clus, sep = ", ", lex.order = TRUE)) 

slope_plot_data %>% 
  filter(estimator == "GEE") %>% 
  ggplot(aes(x = `raw data`, y = `aggregated data`, color = `Total sample size`)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  # geom_point(data = tibble(x = 0.3, y = 0.3), 
  #            aes(x = x, y = y), 
  #            color = "black", size = 3, inherit.aes = FALSE) +
  xlab("Slope coefficient from the full dataset") +
  ylab("Slope coefficient from the aggregated dataset") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 1)) + 
  facet_wrap(~ id, ncol = 3, scales = "fixed")

Warning: Removed 29 rows containing missing values (geom_point).

This plot compares the estimated slope coefficient (true value 0.3) from the GEE model on aggregated data vs the estimated slope coefficient from the model on the raw data while varying the number of clusters as well as the total sample size. The agreement between estimates increases as the number of clusters increases (going from the top left panel to the top right panel).

On the other hand, the estimates from running GLMs on the same aggregated and raw datasets are identical, as indicated by the points lying precisely on the black line in the following plot

slope_plot_data %>% 
  filter(estimator == "GLM") %>% 
  ggplot(aes(x = `raw data`, y = `aggregated data`, color = `Total sample size`)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  # geom_point(data = tibble(x = 0.3, y = 0.3), 
  #            aes(x = x, y = y), 
  #            color = "black", size = 3, inherit.aes = FALSE) +
  xlab("Slope coefficient from the full dataset") +
  ylab("Slope coefficient from the aggregated dataset") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(nrow = 1)) + 
  facet_wrap(~ id, ncol = 3, scales = "fixed")

As expected, the estimates for \(n = 10,000\) are tightly clustered around the true value compared to \(n = 100\) in both sets of plots.

Ok so the estimates aren’t identical for the GEE models unfortunately, but are they approximately similar across the raw and aggregated datasets? This is assessed visually via box plots below for the case of 5 clusters in the data

dist_plot_data <- simulation_results %>% 
  mutate(
    ci_lower = estimate - (qnorm(0.975) * std.error),
    ci_upper = estimate + (qnorm(0.975) * std.error), 
    ci_width = ci_upper - ci_lower, 
    ci_upper_half_width = ci_upper - estimate
  ) %>% 
  filter(term == "Slope", high_se == 0, estimator == "GEE") %>% 
  select(
    Estimate = estimate, 
    `Std. Error` = std.error, 
    `95% CI LL` = ci_lower, 
    `95% CI UL` = ci_upper,
    `95% CI Width` = ci_width,
    `95% CI Upper HW` = ci_upper_half_width,
    n, n_clus_raw, n_clus, data, seed) %>% 
  tidyr::pivot_longer(cols = Estimate:`95% CI Upper HW`, 
                      names_to = "statistic", 
                      values_to = "values") %>% 
  mutate(
    statistic = factor(statistic,
                       levels = c("Estimate", "Std. Error", 
                                  "95% CI LL", "95% CI UL", 
                                  "95% CI Width", "95% CI Upper HW")), 
    n = factor(n, ordered = TRUE), 
    data = stringr::str_to_title(data)
  )

glimpse(dist_plot_data)

Rows: 4,626
Columns: 7
$ n          <ord> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,…
$ n_clus_raw <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
$ n_clus     <fct> # clusters: 5, # clusters: 5, # clusters: 5, # clusters: 5,…
$ data       <chr> "Raw Data", "Raw Data", "Raw Data", "Raw Data", "Raw Data",…
$ seed       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3,…
$ statistic  <fct> Estimate, Std. Error, 95% CI LL, 95% CI UL, 95% CI Width, 9…
$ values     <dbl> -0.0738, 0.1753, -0.4174, 0.2698, 0.6871, 0.3436, -0.1137, …

dist_plot_data %>% 
  filter(n_clus_raw == 5) %>% 
  ggplot(aes(x = n, y = values, color = data)) + 
  #geom_point(position = position_dodge(width = 0.2)) +
  geom_boxplot(position = position_dodge(width = 1)) +
  xlab("Total sample size (n)") + 
  ylab("") +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  facet_wrap(~ statistic, ncol = 2, scales = "free")

So based on these box plots, it seems like most of the sampling distributions for these statistics – point estimate, standard errors, CI width, etc. – are pretty comparable but not identical.

Conclusion

So to conclude, this strategy isn’t really going to be useful when fitting a single GEE model to a real dataset using the R package {geepack}, but it can still be pretty decent when fitting many GEE models to simulated datasets to calculate some statistics. Additionally, all (?) matrix operations performed by GEE eventually boil down to summing some random variables, so it should be possible to have an implementation that works on aggregated data.