One reason why a (g)lm coefficient is NA

R
GLM
Multicollinearity
Published

March 31, 2023

TL;DR: Use the alias function in R to check if you have nested factors (predictors) in your data

Recently while fitting a logistic regression model, some of the coefficients estimated by the model were NA. Initially I thought it was due to separation1, as that’s the most common issue I usually face when fitting unregularized models on data.

1 see this Wikipedia article, or this stats.stackexchange.com thread (and the associated links in the sidebar)

2 in this day and age of ChatGPT, I know

However, googling2 threw up many threads on multicollinearity and anyway, separation usually leads to nonsensical estimates like \(1.5 \times 10^8\) instead of NA.

After combing through many stackexchange threads, I discovered the alias function in R from this thread, which was pretty handy at identifying the problematic column(s).

It’s interesting that the alias documentation doesn’t mention anything about GLMs (glm()) but this does work on glm(..., family = "binomial") model objects3.

3 possibly since the class of a glm object is c("glm", "lm")

The rest of this post explores this issue and its resolution using aggregated test data, where the city variable is intentionally nested within the country variable.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
simulated_data <- tribble(
  ~age, ~city, ~country, ~y, ~N,
  "< 30", "Paris", "France", 30, 100,
  "< 30", "Nice", "France", 20, 100,
  "< 30", "Berlin", "Germany", 23, 100,
  "30+", "Paris", "France", 12, 100,
  "30+", "Nice", "France", 11, 100,
  "30+", "Berlin", "Germany", 27, 100
) %>% 
  mutate(y = y / N)

model <- glm(y ~ age + city + country, weights = N, data = simulated_data, family = "binomial")

summary(model)

Call:
glm(formula = y ~ age + city + country, family = "binomial", 
    data = simulated_data, weights = N)

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -0.8733     0.1876  -4.656 3.23e-06 ***
age30+          -0.4794     0.2062  -2.325   0.0201 *  
cityNice        -0.6027     0.2558  -2.356   0.0185 *  
cityParis       -0.2286     0.2395  -0.954   0.3399    
countryGermany       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 19.2591  on 5  degrees of freedom
Residual deviance:  8.0953  on 2  degrees of freedom
AIC: 43.492

Number of Fisher Scoring iterations: 4

The estimate for Germany is NA. Calling the alias function on this GLM model shows that the dummy variable of Germany is linearly dependent on (a subset of) the other columns.

alias(model)
Model :
y ~ age + city + country

Complete :
               (Intercept) age30+ cityNice cityParis
countryGermany  1           0     -1       -1       

This means that the column for Germany is redundant in this design matrix (or the model matrix), as the values of Germany (the pattern of 0s and 1s) can be perfectly predicted / recreated by combining the Intercept, Nice, and Paris columns using the coefficients from the output of alias(). This is why the perfect multicollinearity in this case leads to an NA coefficient.

# countryGermany and Germany are identical
model.matrix(~ ., data = simulated_data) %>% 
  as_tibble() %>% 
  select(-y, -N, -`age30+`) %>% 
  mutate(Germany = `(Intercept)` - cityNice - cityParis)
# A tibble: 6 × 5
  `(Intercept)` cityNice cityParis countryGermany Germany
          <dbl>    <dbl>     <dbl>          <dbl>   <dbl>
1             1        0         1              0       0
2             1        1         0              0       0
3             1        0         0              1       1
4             1        0         1              0       0
5             1        1         0              0       0
6             1        0         0              1       1

Another interesting observation is that changing the order of the variables

# here country comes before city
glm(y ~ age + country + city, weights = N, data = simulated_data, family = "binomial") %>% 
  alias()
Model :
y ~ age + country + city

Complete :
          (Intercept) age30+ countryGermany cityNice
cityParis  1           0     -1             -1      

leads to different estimates being NA, i.e., the estimate for Paris is now NA and is linearly dependent on the Intercept, country (Germany), and another city (Nice). This depends on which term enters the model first.

The simplest solution here is to drop one of the city or country columns, or to build two separate models – one without country, and one without city.

If the goal is to estimate coefficients for both the city and country variables, then a mixed model with nested effects might be the right rabbit hole to go down, assuming they both have more than 10 levels or so. See the following links: