TL;DR: Use the alias function in R to check if you have nested factors (predictors) in your data
Recently while fitting a logistic regression model, some of the coefficients estimated by the model were NA. Initially I thought it was due to separation1, as that’s the most common issue I usually face when fitting unregularized models on data.
1 see this Wikipedia article, or this stats.stackexchange.com thread (and the associated links in the sidebar)
2 in this day and age of ChatGPT, I know
However, googling2 threw up many threads on multicollinearity and anyway, separation usually leads to nonsensical estimates like \(1.5 \times 10^8\) instead of NA.
After combing through many stackexchange threads, I discovered the alias function in R from this thread, which was pretty handy at identifying the problematic column(s).
It’s interesting that the alias documentation doesn’t mention anything about GLMs (glm()) but this does work on glm(..., family = "binomial") model objects3.
3 possibly since the class of a glm object is c("glm", "lm")
The rest of this post explores this issue and its resolution using aggregated test data, where the city variable is intentionally nested within the country variable.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
simulated_data <-tribble(~age, ~city, ~country, ~y, ~N,"< 30", "Paris", "France", 30, 100,"< 30", "Nice", "France", 20, 100,"< 30", "Berlin", "Germany", 23, 100,"30+", "Paris", "France", 12, 100,"30+", "Nice", "France", 11, 100,"30+", "Berlin", "Germany", 27, 100) %>%mutate(y = y / N)model <-glm(y ~ age + city + country, weights = N, data = simulated_data, family ="binomial")summary(model)
Call:
glm(formula = y ~ age + city + country, family = "binomial",
data = simulated_data, weights = N)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8733 0.1876 -4.656 3.23e-06 ***
age30+ -0.4794 0.2062 -2.325 0.0201 *
cityNice -0.6027 0.2558 -2.356 0.0185 *
cityParis -0.2286 0.2395 -0.954 0.3399
countryGermany NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 19.2591 on 5 degrees of freedom
Residual deviance: 8.0953 on 2 degrees of freedom
AIC: 43.492
Number of Fisher Scoring iterations: 4
The estimate for Germany is NA. Calling the alias function on this GLM model shows that the dummy variable of Germany is linearly dependent on (a subset of) the other columns.
alias(model)
Model :
y ~ age + city + country
Complete :
(Intercept) age30+ cityNice cityParis
countryGermany 1 0 -1 -1
This means that the column for Germany is redundant in this design matrix (or the model matrix), as the values of Germany (the pattern of 0s and 1s) can be perfectly predicted / recreated by combining the Intercept, Nice, and Paris columns using the coefficients from the output of alias(). This is why the perfect multicollinearity in this case leads to an NA coefficient.
# countryGermany and Germany are identicalmodel.matrix(~ ., data = simulated_data) %>%as_tibble() %>%select(-y, -N, -`age30+`) %>%mutate(Germany =`(Intercept)`- cityNice - cityParis)
Another interesting observation is that changing the order of the variables
# here country comes before cityglm(y ~ age + country + city, weights = N, data = simulated_data, family ="binomial") %>%alias()
Model :
y ~ age + country + city
Complete :
(Intercept) age30+ countryGermany cityNice
cityParis 1 0 -1 -1
leads to different estimates being NA, i.e., the estimate for Paris is now NA and is linearly dependent on the Intercept, country (Germany), and another city (Nice). This depends on which term enters the model first.
The simplest solution here is to drop one of the city or country columns, or to build two separate models – one without country, and one without city.
If the goal is to estimate coefficients for both the city and country variables, then a mixed model with nested effects might be the right rabbit hole to go down, assuming they both have more than 10 levels or so. See the following links:
---title: "One reason why a (g)lm coefficient is NA"date: "2023-03-31"categories: [R, GLM, Multicollinearity]code-fold: falsereference-location: marginimage: "aliased-effect.png"---TL;DR: Use the *alias* function in R to check if you have *nested factors* (predictors) in your dataRecently while fitting a logistic regression model, some of the coefficients estimated by the model were `NA`. Initially I thought it was due to *separation*[^1], as that's the most common issue I usually face when fitting unregularized models on data.[^1]: see [this](https://en.wikipedia.org/wiki/Separation_(statistics)) Wikipedia article, or [this](https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression) stats.stackexchange.com thread (and the associated links in the sidebar)However, googling[^2] threw up many threads on multicollinearity and anyway, separation usually leads to nonsensical estimates like $1.5 \times 10^8$ instead of `NA`.[^2]: in this day and age of *ChatGPT*, I knowAfter combing through many stackexchange threads, I discovered the [*alias*](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/alias.html) function in R from [this](https://stats.stackexchange.com/questions/13465/how-to-deal-with-an-error-such-as-coefficients-14-not-defined-because-of-singu) thread, which was pretty handy at identifying the problematic column(s).It's interesting that the alias documentation doesn't mention anything about GLMs (`glm()`) but this does work on `glm(..., family = "binomial")` model objects[^3].[^3]: possibly since the `class` of a `glm` object is `c("glm", "lm")`The rest of this post explores this issue and its resolution using aggregated test data, where the city variable is intentionally nested within the country variable.```{r}library(dplyr)simulated_data <-tribble(~age, ~city, ~country, ~y, ~N,"< 30", "Paris", "France", 30, 100,"< 30", "Nice", "France", 20, 100,"< 30", "Berlin", "Germany", 23, 100,"30+", "Paris", "France", 12, 100,"30+", "Nice", "France", 11, 100,"30+", "Berlin", "Germany", 27, 100) %>%mutate(y = y / N)model <-glm(y ~ age + city + country, weights = N, data = simulated_data, family ="binomial")summary(model)```The estimate for Germany is `NA`. Calling the alias function on this GLM model shows that the dummy variable of Germany is *linearly dependent* on (a subset of) the other columns.```{r}alias(model)```This means that the column for Germany is redundant in this *design matrix* (or the *model matrix*), as the values of Germany (the pattern of 0s and 1s) can be perfectly predicted / recreated by combining the Intercept, Nice, and Paris columns using the coefficients from the output of `alias()`. This is why the perfect [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity) in this case leads to an `NA` coefficient.```{r}# countryGermany and Germany are identicalmodel.matrix(~ ., data = simulated_data) %>%as_tibble() %>%select(-y, -N, -`age30+`) %>%mutate(Germany =`(Intercept)`- cityNice - cityParis)```Another interesting observation is that changing the order of the variables```{r}# here country comes before cityglm(y ~ age + country + city, weights = N, data = simulated_data, family ="binomial") %>%alias()```leads to different estimates being `NA`, i.e., the estimate for Paris is now `NA` and is linearly dependent on the Intercept, country (Germany), and another city (Nice). This depends on which term enters the model first.The simplest solution here is to drop one of the city or country columns, or to build two separate models -- one without country, and one without city.If the goal is to estimate coefficients for both the city and country variables, then a mixed model with nested effects might be the right rabbit hole to go down, assuming they both have more than 10 levels or so. See the following links:- <https://stats.stackexchange.com/questions/197977/analyzing-nested-categorical-data>- <https://stats.stackexchange.com/questions/79360/mixed-effects-model-with-nesting>- <https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model>- Second bullet point from the answer: <https://stats.stackexchange.com/questions/243811/how-to-model-nested-fixed-factor-with-glmm>- <https://stackoverflow.com/questions/70537291/lmer-model-failed-to-converge-with-1-negative-eigenvalue>- <https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html>- <https://stackoverflow.com/questions/40723196/why-do-i-get-na-coefficients-and-how-does-lm-drop-reference-level-for-interact>