Akshat Dwivedi - Specifying an offset in a Tweedie model with identity link

This post explores several things:

differences between two ways of calculating the mean
specifying the response variable and offset in a Tweedie regression model depending on the quantity of interest
log-likelihood and the corresponding score function for a Tweedie model with weights
why one way of specifying offsets in poisson GLMs with the identity link does not extend to Tweedie GLMs

In this article, the words duration, offset, exposure, and weight are used interchangeably.

Introduction

Data coming from fields like ecology, insurance, epidemiology – such as number of new cases of a disease in multiple cities with very different population sizes, number of claims and total claim amounts from insurance policies with different durations, etc. – need to have the varying exposures accounted for as offsets while building models. These might look something like the following simulated dataset with an exposure variable (w) and a response variable (y):

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

n <- 1e6

# tweedie distribution parameters
p <- 1.3 # variance power
mu <- 200 # mean
phi <- 350 # dispersion, so variance = phi * mu ^ p

set.seed(45)
sim_data <- tibble(
  w = runif(n, min = 0.5, max = 3),
  y = tweedie::rtweedie(n = n, mu = mu, phi = phi / w, power = p)
)

# summary(sim_data)

sim_data %>% slice_head(n = 3)

# A tibble: 3 × 2
      w     y
  <dbl> <dbl>
1  2.08     0
2  1.29     0
3  1.10     0

For readability, let’s pretend that this response variable is the individual claim amounts for an insurance company with a client base of 1 million customers. Additionally, for each individual the duration of how long they’ve been insured is also recorded. So the first customer has been insured for about 2 years and 1 month, and has not filed a claim at all. The third customer has been insured for a little over a year and has also not filed any claims.

Ratio of means or mean of ratios?

The quantity of interest from such datasets may be the mean per unit exposure – expected claim amount per individual per year, expected number of new cases of a disease per year per 100,000 people, etc. There are two separate quantities that can be calculated as expected values – the ratio of the means of amount and duration variables, or the mean of the individual ratios. The differences between these two quantities is explained pretty well in this stackexchange thread.

The ratio estimator estimates the first quantity \(R = \sum_i{y_i} / \sum_i{w_i}\). This answers the following question: given a population (or a group of individuals) who were followed-up / observed for a given amount of time, and generated a total amount of claims, how much of the total claim amount can be attributed per person per unit exposure?

The weighted sample mean of the individual ratios estimates the second quantity \(\mathbb{E} = \sum_i{w_i y_i} / \sum_i{w_i}\). This answers the question: given each individual’s claim amount and exposure, what was the average individual’s claim amount per unit exposure?

In the latter case, the link between claim amount and duration at the individual level is preserved, whereas for the former, the numerator and denominators are totals at the population level.

If everyone had exactly the same weight \(w_i\) of 1 year, the denominator would sum to the sample size of 1 million for the data above, and both the estimates would be the same. If the weights were different, then the two results would differ.

In epidemiology, the ratio estimator is used for incidence rate calculations. On the other hand, the insurance papers on Tweedie models mentioned at the bottom of this post all seem to use the weighted mean approach.

For the simulated data above, the two estimates are quite different:

sum(sim_data$y) / sum(sim_data$w)

[1] 114.5185

weighted.mean(sim_data$y, w = sim_data$w)

[1] 200.4661

In this case, the second estimate matches the true mean value chosen for the simulation, which makes sense given that the data are simulated from a Tweedie model with varying exposures (\(Y_i \sim \text{Tweedie}(\mu_i, \phi / w_i, p)\)) but with the mean-variance relationship preserved under unit exposure. This is described in part A of the supplement to the Yang et al 2016 paper.

However, both the numbers are right, as they answer different questions.

Building a regression model

This formula is easy enough to apply if there are no predictors, but a regression model is needed to estimate these quantities when there are predictors in the data.

There are two ways of specifying the model. The first method corresponds to passing the claim amounts unmodified, and passing the exposures as weights:

# weighted mean estimator
summary(glm(y ~ 1, weights = w, data = sim_data,
            family = statmod::tweedie(var.power = 1.3, link.power = 1)))


Call:
glm(formula = y ~ 1, family = statmod::tweedie(var.power = 1.3, 
    link.power = 1), data = sim_data, weights = w)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 200.4661     0.4431   452.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Tweedie family taken to be 349.4572)

    Null deviance: 240499666  on 999999  degrees of freedom
Residual deviance: 240499666  on 999999  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3

The estimated intercept and dispersion correspond to the same parameters (\(\mu = 200\), \(\phi = 350\)) picked for the simulation. This estimate also coincides with the weighted mean estimate. link.power = 1 indicates that the identity link function is used, and var.power = 1.3 or \(p = 1.3\) is assumed to be known here. Usually the profile likelihood approach is used to pick the best value on the interval \(p \in (1, 2)\) for data with exact zeroes.

Once \(p\) is chosen, \(\mu\) can be estimated independently of \(\phi\), and finally dispersion \(\phi\) can be estimated using an optimization algorithm, or one of the estimators described in section 6.8 of the Dunn and Smyth book. The glm function in R uses the Pearson estimator:

\[ \phi = \frac{1}{N-p'} \sum_{i=1}^{N} \frac{w_i (y_i - \hat\mu_i) ^ 2}{\hat\mu_i^p} \]

where \(p'\) is the number of parameters in the model and \(\text{Var}(\mu_i) = \mu_i^p\) is the variance function.

# pearson estimate of dispersion
(sum((sim_data$w * ((sim_data$y - 200.4661)^2)) / (200.4661^1.3))) / (n - 1)

[1] 349.4573

The second method rescales the response variable with the weights, and passes exposures as weights to the model:

# ratio estimator
summary(glm(y / w ~ 1, weights = w, data = sim_data,
            family = statmod::tweedie(var.power = 1.3, link.power = 1)))


Call:
glm(formula = y/w ~ 1, family = statmod::tweedie(var.power = 1.3, 
    link.power = 1), data = sim_data, weights = w)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 114.5185     0.3655   313.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Tweedie family taken to be 492.3754)

    Null deviance: 177557174  on 999999  degrees of freedom
Residual deviance: 177557174  on 999999  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3

This estimate corresponds to the same output as the ratio estimator from the previous section. The estimate of dispersion \(\phi\) is different from the true value of 350, which makes sense as dispersion is a function of the estimated value of the mean \(\mu\).

What does the math look like behind these two models?

To fit a Tweedie model to the data with weights to calculate the two different values in the previous section, the following log-likelihood function – taken from the Yang et al paper – can be maximized.

\[ l(\mu, \phi, p | \{y, x, w\}_{i=1}^n) = \sum_{i = 1}^{n}{ \frac{w_i}{\phi}} \Bigg(y_i \frac{\mu_i^{1-p}}{1-p} - \frac{\mu_i^{2-p}}{2-p} \Bigg) + log\ a(y_i, \phi / w_i, p) \]

For this simulated dataset, \(w_i\) and \(y_i\) vary for each individual, but there are no predictors so an identity link function can be used ¹, the overall mean is of interest so the \(\mu_i\) term can be collapsed into a single \(\mu\) parameter to be optimized, and the log a(...) term can be ignored as it’s not a function of \(\mu\). Since the regression equation for an intercept only term is \(\mu_i = \beta_0\), we can replace the \(\mu\) with \(\beta_0\).

¹ which should be avoided if the model will be used for out-of-sample prediction to avoid predicting values below 0

This makes it easy to compute the closed form solution, which is calculated by differentiating this function to produce the score function, and setting it to 0.

\[ \frac{\partial l}{\partial \beta} = \sum_{i = 1}^{n}{ \frac{w_i}{\phi}} (y_i \beta_0^{-p} - \beta_0^{1-p}) = 0 \]

\(\phi\) and \(\beta_0^{-p}\) are non-zero constants, so can be pulled out of the summation and absorbed into the zero on the right-hand side to give

\[ \sum_{i = 1}^{n}{ w_i (y_i - \beta_0) } = 0 \]

For glm(y ~ 1, weights = w, ...), this equals

\[ \sum_{i = 1}^{n} w_i y_i - \beta_0 \sum_{i = 1}^{n} w_i = 0 \\ \Rightarrow \beta_0 = \frac{\sum_{i = 1}^{n} {w_i y_i}}{\sum_{i = 1}^{n} w_i} \]

and for glm(y / w ~ 1, weights = w, ...), this equals

\[ \sum_{i = 1}^{n} w_i \frac{y_i}{w_i} - \beta_0 \sum_{i = 1}^{n} w_i = 0 \\ \Rightarrow \beta_0 = \frac{\sum_{i = 1}^{n} {y_i}}{\sum_{i = 1}^{n} w_i} \]

A method that works for Poisson but not for Tweedie

What initially led me down this rabbit hole was finding this stackexchange post for specifying the offset with identity link in a poisson model, and naively (and incorrectly) fitting the Tweedie glm like this

summary(glm(y ~ w - 1, data = sim_data,
            family = statmod::tweedie(var.power = 1.3, link.power = 1)))


Call:
glm(formula = y ~ w - 1, family = statmod::tweedie(var.power = 1.3, 
    link.power = 1), data = sim_data)

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
w 121.7328     0.4064   299.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Tweedie family taken to be 465.8885)

    Null deviance:       NaN  on 1000000  degrees of freedom
Residual deviance: 158130565  on  999999  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3

and trying to figure out why this didn’t coincide with the ratio estimate of 114.5185, which is the case for the poisson models

# ratio estimator
sum(sim_data$y) / sum(sim_data$w)

[1] 114.5185

# generates warnings because we're passing a non-discrete response
# point estimates are the same though
unname(coef(suppressWarnings(glm(y ~ w - 1, family = poisson(link = "identity"), data = sim_data))))

[1] 114.5185

unname(coef(suppressWarnings(glm(y / w ~ 1, weights = w, family = poisson(link = "identity"), data = sim_data))))

[1] 114.5185

This happens because substituting \(\mu_i = w_i \beta_0\) instead of \(\mu_i = \beta_0\) in the log-likelihood function

\[ l(\beta_0, \phi, p | \{y, x, w\}_{i=1}^n) = \sum_{i = 1}^{n}{ \frac{w_i}{\phi}} \Bigg(y_i \frac{(w_i \beta_0)^{1-p}}{1-p} - \frac{(w_i\beta_0)^{2-p}}{2-p} \Bigg) + log\ a(y_i, \phi / w_i, p) \]

leads to the following score function

\[ \frac{\partial l}{\partial \beta} = \sum_{i = 1}^{n} \frac{w_i}{\phi} (y_i w_i^{1-p} \beta_0^{-p} - w_i^{2-p} \beta_0^{1-p}) = 0 \]

The constant \(\phi\) and \(\beta_0^{-p}\) terms can be dropped from the equation, and the \(w_i\) outside the bracket are all 1 because no weights are passed to the glm call, so the following equation is solved

\[ \sum_{i = 1}^{n} y_i w_i^{1-p} - w_i^{2-p} \beta_0 = 0 \]

which can be simplified by pulling \(w_i^{1-p}\) as a common term

\[ \sum_{i = 1}^{n} w_i^{1-p} (y_i - w_i \beta_0) = 0 \]

This shows where the logical error happens, as well as how the correct estimate can be obtained, i.e., by passing weights = w ^ (p - 1) to the glm call, so that the \(w_i^{1-p}\) term cancels out

summary(glm(y ~ w - 1, weights = I(w^(1.3 - 1)), data = sim_data,
            family = statmod::tweedie(var.power = 1.3, link.power = 1)))


Call:
glm(formula = y ~ w - 1, family = statmod::tweedie(var.power = 1.3, 
    link.power = 1), data = sim_data, weights = I(w^(1.3 - 1)))

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
w 114.5185     0.3655   313.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Tweedie family taken to be 492.3754)

    Null deviance:       NaN  on 1000000  degrees of freedom
Residual deviance: 177557174  on  999999  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 3

Now the point estimates, standard errors, and dispersion parameters correspond to the model where the ratio estimator is correctly specified.

Here’s the code and the plot of the score equations for the different models:

Code

parameter_grid <- seq(50, 300, 5)

estimating_equation_curves <- tibble(
  parameter = parameter_grid,
   `Mean of ratios` = map_dbl(.x = parameter_grid, .f = ~ {
    sim_data %>%
      mutate(s = w * (y - .x)) %>%
      pull(s) %>%
      sum()
  }),
  `Ratio of means` = map_dbl(.x = parameter_grid, .f = ~ {
    sim_data %>%
      summarize(across(.cols = c(w, y), .fns = sum)) %>%
      mutate(s = y - (.x * w)) %>%
      pull(s)
  }),
  Incorrect = map_dbl(.x = parameter_grid, .f = ~ {
    sim_data %>%
      mutate(s = (w ^ (1 - 1.3)) * (y - w * .x)) %>%
      pull(s) %>%
      sum()
  })
)

estimating_equation_curves %>%
  pivot_longer(-parameter) %>%
  ggplot(aes(x = parameter, y = value, linetype = name)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dotdash") +
  # geom_vline(xintercept = 114.5185, linetype = 2) +
  # geom_vline(xintercept = 121.7328, linetype = 1) +
  # geom_vline(xintercept = 200, linetype = 11) +
  theme_bw() +
  ylab("Score function") +
  xlab("Parameter") +
  theme(legend.position = "bottom", legend.title = element_blank())

References

Yang et al 2016 paper and supplement (link)
Delong et al 2021 paper (link)
Chapter 12 on Tweedie models from the Dunn and Smyth book on GLMs

Others

Zhang 2013 (link)
Stan code for Tweedie
- https://discourse.mc-stan.org/t/tweedie-likelihood-compound-poisson-gamma-in-stan/14636
- this gist