Simulating data from the cause-specific hazard approach to competing risks

R
Simulation
Time-to-event
Published

August 9, 2025

In a previous post, I used the cumulative hazard inversion method to simulate time-to-event (TTE) data from a log-logistic accelerated-failure time (AFT) model with statistical cure. The estimand of interest was the cumulative incidence function (CIF), which is the probability of experiencing the event of interest before time \(t\) (i.e., \(\text{Pr}[T \leq t]\), also known as the cumulative distribution function (CDF) in the absence of censoring).

I came across two things while reading up on material for that post, that I’d mentally filed away for looking into in the future. The first was on simulating TTE data using numerical methods. The second was on estimating the CIF in the competing risk (CR) setting where it wasn’t correct to invert the survivor function (\(S(t) = \text{Pr}[T > t] = 1 - \text{Pr}[T \le t]\)) to get to the CIF.

So this post is on simulating data from the cause-specific hazards (CSH) approach to CR based on this paper (Beyersmann et al. 2009); other key papers I use material from are listed in the references below. To check whether my implementation (and understanding) is correct, I run a mini simulation to see if I can recover the true parameters used for generating the data. I also compare the estimated cause-specific CIFs with the true CIFs.

I initially tried to use the survsim package to simulate CSH CR data, but I found it quite slow. Figuring out what I was doing suboptimally was only clear to me after digging deeply into it and reimplementing the code myself by studying the implementation of survsim::crisk.ncens.sim and the Beyersmann paper. My implementation has the advantage that it

I mean I’d most likely still use survsim, unless I need to simulate a lot of not very small datasets, or want the additional flexibility around simulation-model specification.

I think adjustedCurves::sim_confounded_crisk() can simulate CR data from the CSH approach as well, but I haven’t tried it out.

I’m using the following packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(survival)
library(flexsurv)
# functions called via namespace::fun() to avoid conflicts
# library(glue)  # string interpolation
# library(mirai)  # for setting daemons() for parallelization
# library(survsim)
# library(ggh4x)  # facet_grid2 function
# library(bench)  # for benchmarking

theme_set(theme_bw())

Data generating process

In the usual TTE setting, interest is on a single event. An example from marketing is that of customer churn, where a customer cancels their subscription. The competing risk approach to TTE analysis extends this to cases where more than one event can occur.

Sticking with the churn example, this could be (say) three possible event types (using the event indicator \(D\) with values \(k = 1, 2, 3 (= K)\)) — churn due to customer dissatisfaction with some aspect of the product / service which is the main event of interest that a company would like to reduce (\(D = 1\)), churn due to financial reasons (\(D = 2\)), and churn due to unavoidable reasons (\(D = 3\)). \(D = 0\) indicates not having churned (yet). I’m using an example just to make things a bit more concrete; these hazard functions are unlikely to be representative of any actual churn process.

This single vs multiple TTE analysis is analogous to the distinction between binary (binomial) vs categorical (multinomial) outcomes, with the former taking on values 0 / 1, and the latter taking on more than 2 values e.g., 0 / 1 / 2 / 3.

Distributional parameters

The three churn event types are assumed to have the hazard functions corresponding to these distributions

\[ \begin{align*} T^1 | x_1 = 0 &\sim \text{Weibull}(\text{shape} = 1.6, \text{scale} = 5) \\ T^2 | x_1 = 0 &\sim \text{Log-logistic}(\text{shape} = 5, \text{scale} = 5) \\ T^3 | x_1 = 0 &\sim \text{Exponential}(\text{rate} = 5) \end{align*} \]

for the control arm (\(x_1 = 0\)) assuming some experiment is carried out to assess the effectiveness of some action intended to reduce churn. Treatment arm can be coded as \(x_1 = 1\) with treatment assumed to reduce the rate of churn by a time acceleration factor of 1.4 or 40%. So we have

\[ \begin{align*} T^1 | x_1 = 0 &\sim \text{Weibull}(\text{shape} = 1.6, \text{scale} = 5) \\ T^1 | x_1 = 1 &\sim \text{Weibull}(\text{shape} = 1.6, \text{scale} = 7) \end{align*} \]

and \(T^2 |x_1\) and \(T^3 | x_1\) are the same for \(x_1 = 0\) and \(x_1 = 1\).

Choosing \(x_1\) to have a non-zero effect only on \(T^1\) means I’m avoiding some of the messiness that comes with interpreting treatment effects on the CIFs.

I’m also assuming a 50-50 (%) split of \(x_1\) into the two groups.

These values for the shape / scale / rate parameters were chosen to have decent balance across the different event types and to not have some categories with very low event counts.

Latent failure time approach

The use of \(T^{\bullet}\) makes it seem like I’m invoking the latent failure time approach to CR, but I’m only using it as a convenient way of indicating the chosen CSH functions. As pretty much all papers on CR warn, the latent variable approach — which retains \(T_i = min(T^1_i, \dots, T^K_i)\) for each individual \(i\) — assumes unverifiable dependence structures between the latent event time variable \(T^k\) which are generally not identifiable from observed data.

This is sad because the latent variable approach seems a lot more intuitive and is very simple to generate data from when the latent event times are assumed to be independent

n <- 7
set.seed(25)
tibble(
  # generate group membership
  x1 = rbinom(n, 1, prob = 0.5),
  # generate counterfactual for x1 = 0
  t1_0 = rweibull(n, shape = 1.6, scale = 5),
  # and counterfactual for x1 = 1 (using scale = 7)
  t1_1 = rweibull(n, shape = 1.6, scale = exp(log(5) + log(1.4))),
  # use the consistency equation to realize the latent t1
  t1 =  (x1 * t1_1) + ((1 - x1) * t1_0),
  t2 = rllogis(n, shape = 5, scale = 5),
  # exponential distribution, same as rexp(n, 1 / 5)
  t3 = rweibull(n, shape = 1, scale = 5),
  # administrative censoring
  cens_time = 4,
  # take the minimum of the latent event and censoring times
  time = pmin(t1, t2, t3, cens_time), 
  # record which event / censor time is the smallest
  event = case_when(
    cens_time < t1 & cens_time < t2 & cens_time < t3 ~ 0,
    t1 < t2 & t2 < t3 ~ 1,
    t2 < t1 & t2 < t3 ~ 2,
    t3 < t1 & t3 < t2 ~ 3
  )
) %>% 
  mutate(across(.cols = everything(), .fns = ~ round(.x, 2)))
# A tibble: 7 × 9
     x1  t1_0  t1_1    t1    t2    t3 cens_time  time event
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
1     0  5.26  3.72  5.26  4.81  3.15         4  3.15     3
2     1  9.32 10.5  10.5   3.17  7.18         4  3.17     2
3     0  5.79  5.18  5.79  3.56  0.97         4  0.97     3
4     1  5.35  3.42  3.42  4.19  5.26         4  3.42     1
5     0  5.04  5.8   5.04  4.22  7.2          4  4        0
6     1  0.69  3.4   3.4   3    11.4          4  3        2
7     1  3.36 14.5  14.5   5.39  0.92         4  0.92     3

I’m quite sure that if I used this for my simulation I’d still end up with unbiased parameter estimates, but the assumption of independence may not be very realistic depending on the application.

A future rabbit hole to go down would be to look into the literature on using copulas for modelling dependence between the latent event times.

The event time for each individual will be referred to using a single random variable \(T\) since no latent variables are assumed.

Cause-specific hazards (CSH)

We can plot the CSH functions defined as

\[ h_k(t | x_1) = \lim_{\Delta t \rightarrow 0} \frac{\text{Pr}[t \le T < t + \Delta t, D = k | T \ge t, x_1]}{\Delta t} \]

for the \(k^\text{th}\) event showing the instantaneous rate of occurrence of that event

Code
time_seq <- seq(0.01, 50, 0.01)
hazards <- bind_rows(
  tibble(
    Distribution = "Event 1: Weibull(shape = 1.6, scale = 5)",
    time = time_seq,
    Hazard = hweibull(time_seq, shape = 1.6, scale = 5)
  ),
  tibble(
    Distribution = "Event 1: Weibull(shape = 1.6, scale = 7)",
    time = time_seq,
    Hazard = hweibull(time_seq, shape = 1.6, scale = 7)
  ),
  tibble(
    Distribution = "Event 2: Log-logistic(shape = 5, scale = 5)",
    time = time_seq,
    Hazard = hllogis(time_seq, shape = 5, scale = 5)
  ),
  tibble(
    Distribution = "Event 3: Exponential(rate = 5)",
    time = time_seq,
    Hazard = hweibull(time_seq, shape = 1, scale = 5)
  )
)

hazards %>%
  ggplot(aes(x = time, y = Hazard, group = Distribution, 
             linetype = Distribution)) +
  geom_line() +
  xlab("Time (t)") +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom"
  ) +
  guides(linetype = guide_legend(nrow = 2))

I’m using the hazard functions implemented within flexsurv (e.g., flexsurv::hweibull()). The exponential hazard is constant over time. The log-logistic hazard first rises and then falls with time. The Weibull hazards for the chosen parameters are monotonically increasing functions of time. The hazard for each event is independent of the hazard for all other events.

Relative contribution to the overall hazard

As mentioned in the Hinchliffe and Lambert (2013) paper, it can be useful to look at the relative contribution of the hazard of each event to the overall hazard within each group at each \(t\)

\[ \text{Pr}[D = k | t \le T < t + \Delta t, T \ge t, x_1] = \frac{h_k(t | x_1)}{\sum_{k = 1}^K h_k(t | x_1)} \]

This can be interpreted as the probability of an event of type \(k\) among those who experience an event at time \(t\) and haven’t experienced any of the events before \(t\).

Code
# prob(cause | event = 1)
cause_specific_hazards_by_group <- bind_rows(
  hazards %>%
    # for x = 0, drop the cause 1 for x = 1
    filter(Distribution != "Event 1: Weibull(shape = 1.6, scale = 7)") %>%
    mutate(Group = "x1 = 0"),
  hazards %>%
    # for x = 1, drop the cause 1 for x = 0
    filter(Distribution != "Event 1: Weibull(shape = 1.6, scale = 5)") %>%
    mutate(Group = "x1 = 1")
)

relative_contribution_to_overall_hazard <-cause_specific_hazards_by_group %>%
  arrange(Group, time, Distribution) %>%
  group_by(Group, time) %>%
  mutate(overall_hazard = sum(Hazard), p = Hazard / overall_hazard) %>%
  ungroup()

relative_contribution_to_overall_hazard %>%
  ggplot(aes(x = time, y = p, 
             group = interaction(Group, Distribution), 
             linetype = Distribution)) +
  geom_line() +
  xlab("Time (t)") +
  ylab("Pr [event type k | any event]") +
  facet_wrap(~ Group) +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom"
  ) +
  guides(linetype = guide_legend(nrow = 2))

For individuals that churn very soon after \(t = 0\), they’re more likely to have had \(D = 3\), followed by \(D = 1\), followed by \(D = 2\). Events that occur around \(t = 5\) are most likely to be \(D = 2\) as shown by the long-dash line peaking after that time.

Cause-specific cumulative incidence function (CIF)

A key quantity is the cumulative incidence function (CIF) for each of the \(K = 3\) events, which is the probability that an individual will experience event \(k\) before time \(t\) in the absence of the occurrence of any of the other events

\[ \text{CIF}_k(t|x_1) = \text{Pr}[T \le t, D = k | x_1] = \int_{u=0}^{u=t} h_k(u | x_1) S(u | x_1) du \]

\(h_k(t|x_1)\) is the CSH function, and \(S(t|x_1)\) is the overall survivor function in each group. The latter is defined as the probability of being free of any event up to time \(t\)

\[ S(t | x_1) = \text{Pr}[T > t | x_1] = \text{exp}\Bigg(- \sum_{k = 1}^K H_k(t | x_1)\Bigg) \]

\(H_k(t | x_1)\) is the cumulative hazard function for the \(k^{\text{th}}\) event in each group

\[ H_k(t | x_1) = \int_{u = 0}^{u = t} h_k(u | x_1) du \]

The cumulative hazard functions are used as implemented in flexsurv (e.g., flexsurv::Hweibull()). In discrete-time, the integral for the CIF is replaced by summation, and \(S(u|\cdot)\) in the integrand / summand is replaced with \(S(-u | \cdot)\), which is the survival probability at a time point right before \(u\). The CIFs and \(S(t)\) are probabilities of being in the different states by time \(t\)

Code
# calculate the overall survival from the overall hazard
overall_survival <- time_seq %>% 
  expand_grid(
    x1 = c(0, 1), time = .
  ) %>%
  mutate(
    surv = exp(-(
      Hweibull(x = time, shape = 1.6, scale = exp(log(5) + log(1.4) * x1), 
               log = FALSE) +
      Hllogis(x = time, shape = 5, scale = 5, log = FALSE) +
      Hweibull(x = time, shape = 1, scale = 5, log = FALSE)
    )),
    Group = paste0("x1 = ", x1)
  ) %>%
  select(-x1)

time_step <- time_seq[2] - time_seq[1]

# calculate CIFs
cumulative_incidence_curves <- cause_specific_hazards_by_group %>%
  inner_join(y = overall_survival, by = c("Group", "time")) %>%
  group_by(Group, Distribution) %>%
  mutate(CIF = order_by(time, cumsum(Hazard * surv * time_step))) %>%
  ungroup()

# plot the CIFs
# P[T <= t, cause = k | Event]
cumulative_incidence_curves_plot_data <- cumulative_incidence_curves %>%
  select(-Hazard) %>%
  pivot_wider(id_cols = c(time, Group, surv), 
              names_from = Distribution, values_from = CIF) %>%
  rename(`Event-free` = surv) %>%
  pivot_longer(cols = -c(time, Group), 
               names_to = "Event", values_to = "CIF") %>%
  filter(complete.cases(.))

cumulative_incidence_curves_plot <- cumulative_incidence_curves_plot_data %>%
  ggplot(aes(x = time, y = CIF, linetype = Event)) +
  geom_line() +
  xlab("Time (t)") +
  ylab("Probability of being in state") +
  facet_wrap(~ Group) +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom"
  ) +
  guides(linetype = guide_legend(nrow = 3))

cumulative_incidence_curves_plot

The CIFs are called subdistribution functions because they never reach 1 (at \(t = \infty\)) due to the presence of competing events so are not proper probability distributions. In the single event case, the CIF eventually reaches 1 given enough time.

Within each panel, the curves sum to 1 at every time point, since each individual is in one state at each time point

\[ \sum_{k = 1}^K\text{CIF}_k(t | x_1) + S(t | x_1) = 1 = \text{Pr}[T \le t, D = 1 | x_1] + \text{Pr}[T \le t, D = 2 | x_1] + \text{Pr}[T \le t, D = 3 | x_1] + \text{Pr}[T > t | x_1] \]

All individuals start with being event free at \(t = 0\), but as as time progresses, they move into any one of the \(k\) states, so the probability of being event-free, i.e., staying in state \(D = 0\), decreases with time and goes to 0 eventually (here around \(t = 7\)).

One difference between this plot and the unnormalized hazard plot is that here the hazard of each event contributes to the CIF of all other events via it’s dependence on the overall survival. So \(\text{CIF}_2(t | x1 = 0) \neq \text{CIF}_2(t | x1 = 1)\) and \(\text{CIF}_3(t | x1 = 0) \neq \text{CIF}_3(t | x1 = 1)\) because \(H_1(t | x_1 = 0) \ne H_1(t | x_1 = 1)\), even though \(H_2(t | x_1 = 0) = H_2(t | x_1 = 1)\) and \(H_3(t | x_1 = 0) = H_3(t | x_1 = 1)\).

Relative contribution to overall churn

Similar to the relative contribution of the CSHs to overall hazard I came across in the Hinchliffe and Lambert paper, they also mention the rescaled cause-specific CIFs (disregarding \(S(t | x_1)\))

\[ \text{Pr}[D = k | T \le t, x_1] = \frac{\text{CIF}_k(t | x_1)}{\sum_{k = 1}^K \text{CIF}_k(t | x_1)} \]

which is the probability of having had event \(k\) among those who have churned by time \(t\) within levels of \(x_1\)

Code
# plot the rescaled CIFs conditional on event
# so ignores the no even state
# P[T <= t, cause = k | Event]
cumulative_incidence_curves %>%
  mutate(rescaled_CIF = CIF / sum(CIF), .by = c(Group, time)) %>%
  # arrange(Group, time) %>%
  # print()
  ggplot(aes(x = time, y = rescaled_CIF, linetype = Distribution)) +
  geom_line() +
  xlab("Time (t)") +
  ylab("Pr [event type k by t | any event]") +
  facet_wrap(~ Group) +
  theme(
    legend.title = element_blank(),
    legend.position = "bottom"
  ) +
  guides(linetype = guide_legend(nrow = 2))

Implement the simulation steps

The key idea behind simulating CR data from the CSH approach is summarized in the following steps (here within each level of covariate \(x_1\)):

  • specify the cause-specific hazards \(h_k(t | x_1)\)

  • simulate survival time \(T_i\) for the \(i^\text{th}\) individual with the all-cause cumulative hazard \(\sum_{k=1}^K H_k(t | x_1)\)

  • for a simulated \(T_i\), simulate the event type \(D_i\) from a multinomial distribution with \(k\)-dimensional probability vector from the scaled hazards \(h_k(t | x_1) / \sum_{k = 1}^K h_k(t | x_1)\). This gives the pair \((T_i, D_i)\) for the \(i^\text{th}\) individual

  • simulate censoring time \(C_i\) and set \(T_i = C_i\) if \(C_i < T_i\) (and set \(D_i = 0\)).

The event time \(T_i\) is generated by inverting the survivor function. The CDF \(F(t)\) has a \(\text{Uniform}[0, 1]\) distribution where \(F(t_i) = u_i \sim U[0, 1]\). We can use \(F(t) = 1 - S(t)\) to equate \(S(t) = 1 - u\) which gives

\[ 1 - u = \text{exp}\Bigg[- \sum_{k = 1}^K H_k(t) \Bigg] \]

Taking logs on both sides and rearranging gives the objective function

\[ \text{log}(1 - u) + \sum_{k = 1}^K H_k(t) = 0 \]

For a randomly drawn \(u_i \sim U[0,1]\) (and given \(x_{i, 1}\)), numerical root finding methods can be employed to estimate the value of \(t_i\) which satisfies this equation. This process of using \(u_i\) to estimate \(t_i\) is known as inverse transform sampling.

Overall hazard

This function implements the sum of the cumulative hazards \(H_k(t | x_1)\) which is 0 at \(t = 0\) but increases without bound beyond that

all_cause_cumulative_hazards <- function(t, x1) {
  sum(
    Hweibull(x = t, shape = 1.6, scale = exp(log(5) + log(1.4) * x1), log = FALSE),
    Hllogis(x = t, shape = 5, scale = 5, log = FALSE),
    Hweibull(x = t, shape = 1, scale = 5, log = FALSE)
  )
}

all_cause_cumulative_hazards(t = 0, x1 = 0)
[1] 0
all_cause_cumulative_hazards(t = 10, x1 = 0)
[1] 8.527941
all_cause_cumulative_hazards(t = 10, x1 = 1)
[1] 7.265977

Event type probabilities

This next one calculates the event probabilities (conditional on an event at \(t\)) \(\text{Pr}[D = k | t \le T < t + \Delta t, T \ge t, x_1]\) given \(t\) and \(x_1\)

cause_specific_probabilities <- function(t, x1) {
  hazards <- c(
    hweibull(x = t, shape = 1.6, 
             scale = exp(log(5) + log(1.4) * x1), log = FALSE),
    hllogis(x = t, shape = 5, scale = 5, log = FALSE),
    hweibull(x = t, shape = 1, scale = 5, log = FALSE)
  )

  hazards / sum(hazards)
}

cause_specific_probabilities(t = 0, x1 = 0)
[1] 0 0 1
cause_specific_probabilities(t = 10, x1 = 0)
[1] 0.4145983 0.4144437 0.1709580
cause_specific_probabilities(t = 10, x1 = 1)
[1] 0.2924853 0.5008953 0.2066193

The function returns a \(K\)-dimensional vector which sums to 1.

Objective function

Implement the objective function

obj_fun <- function(t, u, x1) {
  log(1 - u) + all_cause_cumulative_hazards(t, x1)
}

obj_fun(t = 3.771, u = 0.8, x1 = 0)
[1] -0.0001226652
tibble(
  time = time_seq[time_seq < 10.0], 
  obj_val = map_dbl(
    .x = time, 
    .f = ~ obj_fun(.x, u = 0.8, x1 = 0)
    )
  ) %>% 
  ggplot(aes(x = time, y = obj_val)) + 
  geom_line() + 
  geom_hline(yintercept = 0, linetype = "dotted") + 
  geom_vline(xintercept = 3.771, linetype = "dashed") + 
  xlab("Time (t)") + 
  ylab("Objective function")

This plot corresponding to a grid search on \(t \in (0, 10)\) shows that for \(u_i = 0.8\), \(t_i \approx 3.771\) (dashed vertical line) is very close to the root since the objective function evaluates to approximately -0.00012.

Doing grid search is kinda tedious, but we can use stats::uniroot to find the root \(t_i\) for a given \(u_i\) (and \(x_{i, 1}\)) and return the pair \((T_i, D_i)\).

Simulate single outcome

simulate_single_outcome <- function(u, x1, admin_censor_time = 5, lower_bound_time = 1e-10) {

  # check whether simulated time would lie outside the time interval we want to solve for
  # in this case the function would have the same sign at the interval endpoints
  # if the function has different signs at the endpoints then a root is within the interval
  same_sign <- (obj_fun(u = u, t = 0.00001, x1 = x1) * obj_fun(u = u, t = admin_censor_time, x1 = x1)) > 0

  if (same_sign) {
    time <- admin_censor_time
    cause <- 0
    iter <- NA_real_
    fn_value_at_root <- NA_real_
  } else {
    # for a handful of cases (out of 100,000s) we can end up with t = 0 from a distribution that doesn't support it
    # https://stats.stackexchange.com/questions/176376/invalid-survival-times-for-this-distribution
    # so we can set a lower bound on time > 0
    uniroot_obj <- uniroot(
      f = obj_fun, 
      interval = c(lower_bound_time, admin_censor_time), 
      # pass arguments to the objective function
      u = u, x1 = x1
    )
    time <- uniroot_obj$root
    cause <- which.max(
      rmultinom(1, 1, cause_specific_probabilities(t = time, x1 = x1))
    )
    iter <- uniroot_obj$iter
    fn_value_at_root <- uniroot_obj$f.root
  }

  tibble(u, x1, admin_censor_time, iter, fn_value_at_root, time, cause)
}

bind_rows(
  # non-censored
  simulate_single_outcome(u = 0.2, x1 = 1, admin_censor_time = 5),
  # censored
  simulate_single_outcome(u = 0.95, x1 = 1, admin_censor_time = 5),
  # same u but larger admin censored time means not censored
  simulate_single_outcome(u = 0.95, x1 = 1, admin_censor_time = 12)
)
# A tibble: 3 × 7
      u    x1 admin_censor_time  iter fn_value_at_root  time cause
  <dbl> <dbl>             <dbl> <dbl>            <dbl> <dbl> <dbl>
1  0.2      1                 5     6     -0.00000118  0.920     3
2  0.95     1                 5    NA     NA           5         0
3  0.95     1                12     7     -0.000000103 5.76      2

A lower bound of \(t\) close to 0 is applied to avoid ending up with estimates of exactly 0, which can cause problems with estimation for some parametric distributions. If the drawn \(u_i\) is large enough that the objective function has the same sign at both endpoints of the interval, then we skip the root finding as that observation would be treated as censored anyway.

Administrative censoring at \(t = 5\) is assumed, which should lead to about 7% censored individuals in \(x_1 = 0\) and 10% in \(x_1 = 1\) (and about 8-9% overall) as seen in this zoomed-in version of the CIF plot (from a few sections above)

Code
cumulative_incidence_curves_plot + 
  geom_vline(xintercept = 5.0, color = "forestgreen") +
  geom_hline(
    data = tibble(Group = c("x1 = 0", "x1 = 1"), y = c(0.068, 0.102)),
    aes(yintercept = y), color = "purple"
  ) +
  coord_cartesian(xlim = c(4.5, 6), ylim = c(0.05, 0.12))

Simulate single dataset

Finally, the next function stitches together the steps for simulating a full dataset with \(n = 1,000\) individuals and their group indicator \(x_{i, 1} \in \{0, 1\}\), event time \(T_i\), and event indicator \(D_i \in \{0, 1, 2, 3\}\).

simulate_single_dataset <- function(n_rows = 1000, prop_x1 = 0.5, seed = 12345) {
  if(!is.null(seed)) set.seed(seed)
  x1 <- rbinom(n = n_rows, size = 1, prob = prop_x1)
  u <- runif(n_rows)
  1:n_rows %>% 
    map(.f = ~ simulate_single_outcome(u = u[.x], x1 = x1[.x])) %>%
    list_rbind()
}

simulate_single_dataset(n_rows = 5)
# A tibble: 5 × 7
      u    x1 admin_censor_time  iter fn_value_at_root  time cause
  <dbl> <int>             <dbl> <dbl>            <dbl> <dbl> <dbl>
1 0.166     1                 5     6     -0.00000122  0.765     3
2 0.325     1                 5     5     -0.000000279 1.52      1
3 0.509     1                 5     6     -0.00000489  2.47      3
4 0.728     1                 5     6     -0.000000505 3.70      2
5 0.990     0                 5    NA     NA           5         0

Test the implementation

Parameter estimates

To check that the simulation function works, a bunch of functions are defined in the next code chunk that take a single simulated dataset, convert it into cause-specific datasets (for each event type \(k\), update the event indicator by censoring competing events), fit a (correctly specified) parametric model separately to each cause-specific dataset, extract the coefficients from each model, and return them as a data frame.

create_cause_specific_datasets <- function(data) {
  map(
    .x = 1:3,
    .f = ~ {
      data %>%
        select(x1, time, cause) %>%
        mutate(event = as.numeric(cause == .x))
    }
  )
}

fit_cause_specific_models <- function(cause_specific_datasets, distributions = c("weibull", "llogis", "weibull")) {
  map2(
    .x = cause_specific_datasets,
    .y = distributions,
    .f = ~ flexsurvreg(Surv(time = time, event = event) ~ x1, data = .x, dist = .y)
  )
}

extract_coefs <- function(list_models) {
  map2_dfr(.x = list_models, .y = 1:length(list_models), .f = ~ {
    .x %>%
      # calls flexsurv:::tidy.flexsurvreg()
      flexsurv::tidy() %>%
      select(term, estimate, se = std.error) %>%
      mutate(cause = .y, .before = term)
  })
}

run_pipeline <- function(seed) {
  seed %>%
    simulate_single_dataset(seed = .) %>%
    create_cause_specific_datasets() %>%
    fit_cause_specific_models() %>%
    extract_coefs() %>%
    mutate(rep = seed, .before = cause)
}

# wrap it up into purrr::safely() since it fails 
# for some seeds (e.g. 125)
safely_run_pipeline <- safely(run_pipeline)

safely_run_pipeline(seed = 3)
$result
# A tibble: 9 × 5
    rep cause term  estimate     se
  <dbl> <int> <chr>    <dbl>  <dbl>
1     3     1 shape   1.66   0.0801
2     3     1 scale   4.93   0.246 
3     3     1 x1      0.297  0.0724
4     3     2 shape   5.50   0.359 
5     3     2 scale   4.88   0.153 
6     3     2 x1      0.0281 0.0391
7     3     3 shape   1.07   0.0418
8     3     3 scale   4.99   0.333 
9     3     3 x1     -0.0972 0.0851

$error
NULL

The safely_run_pipeline() function returns the point estimates and standard errors for the shape, scale, and coefficient for \(x_1\) from each cause-specific model.

This process is repeated 5,000 times. A single run takes about 1.6 seconds on my machine, so to save time, the runs are parallelized. Using the recently added purrr::in_parallel function cuts down the total execution time from a little over 2 hours to about 20 minutes (it still took 3x longer than I expected, so I’d like to look into that later — overhead is expected from parallelization but not 3x. Maybe it would have been better to use fewer cores.).

Using the version of purrr at the time of writing, in_parallel requires that all these functions defined above are nested into one big function which can then be shipped off to the executors. All the package functions should be explicitly namespaced via package::fun() as well. The full script for running this pipeline is here.

Saved results can be read back in. About 44 or 0.9% of the simulations (with the following seeds) failed at some step.

Code
n_reps <- 5000
failed_parallel_runs <- read_rds("saved_objects/failed_parallel_runs.rds")
failed_parallel_runs
 [1]    5   39  124  125  315  319  445  510  662  672  765 1058 1105 1238 1494
[16] 1673 1832 1844 2152 2176 2277 2399 2521 2526 2888 2944 2995 3065 3574 3657
[31] 3754 3818 4222 4234 4246 4259 4326 4414 4465 4481 4482 4742 4873 4978
Code
length(failed_parallel_runs) / n_reps
[1] 0.0088

For the seeds that ran without error, the estimates can be plotted with the true parameter values overlaid for comparison. No transformation is needed when using flexsurv:::tidy.flexsurvreg() (except for the \(x_1\) coefficient which is reported on the log scale)

Code
true_params <- tribble(
  ~ cause, ~ term, ~ estimate,
  1L, "shape", 1.6,
  1L, "scale", 5,
  1L, "x1", log(1.4),
  2L, "shape", 5,
  2L, "scale", 5,
  2L, "x1", 0,
  3L, "shape", 1,
  3L, "scale", 5,
  3L, "x1", 0
)

parameter_estimates_parallel <- read_rds("saved_objects/parameter_estimates_parallel.rds")

parameter_estimates_parallel %>% 
  ggplot(aes(x = estimate, group = interaction(cause, term))) +
  geom_density() +
  geom_vline(data = true_params, aes(xintercept = estimate)) +
  xlab(glue::glue("Sampling distribution of simulation parameters ", 
                  "(5,000 datasets with n = 1,000)")) +
  ggh4x::facet_grid2(rows = vars(cause), cols = vars(term), 
                     scales = "free", independent = TRUE) +
  theme_classic() + 
  theme(legend.position = "bottom")

It’s reassuring but not surprising, given the ubiquity of this approach to CSH CR analysis, to see the density estimates centered at the true values indicating unbiasedness of the CSH approach.

CIF estimates

Since the parameters are estimated correctly, any downstream quantities like the cause-specific CIFs and overall survival should be unbiased too. In the next figure, I’m visualizing these for the four states within \(x_1 = 0\). The true curves and the average of 100 curves from randomly simulated datasets (restricted to \(t \in (0, 10]\)) are pretty close

Code
sampling_distribution_CIFs <- read_rds("saved_objects/sampling_distribution_CIFs.rds")

sampling_distribution_CIFs_mean <- sampling_distribution_CIFs %>%
  summarise(CIF = mean(CIF), .by = c(Event, time))

sampling_distribution_CIFs %>%
  ggplot(aes(x = time, y = CIF, group = rep)) +
  geom_line(color = "gray80") +
  geom_line(
    data = cumulative_incidence_curves_plot_data %>%
    filter(Group == "x1 = 0", time <= 10.0) %>% 
    mutate(Event = fct_relevel(factor(Event), "Event-free", after = 0)),
    aes(x = time, y = CIF),
    color = "gray20", linewidth = 1.5,
    inherit.aes = FALSE
  ) +
  geom_line(
    data = sampling_distribution_CIFs_mean,
    aes(x = time, y = CIF),
    color = "gray20", linewidth = 1.5, linetype = "dotted",
    inherit.aes = FALSE
  ) +
  geom_vline(xintercept = 5, linetype = "dashed", color = "gray20") +
  xlab(glue::glue("Time (t)\nSampling distribution of curves ", 
                  "estimated from 100 datasets with ", 
                  "n = 1,000 (light gray); \ntrue curve ", 
                  "(gray, solid); averaged curve (gray, dotted);",
                  "\nmaximum observed time (gray, dashed)")) +
  ylab("Probability of being in state") +
  facet_wrap(~ Event, scales = "free_y")

Intuitively, the variability (or lack of) boils down to the number of events of each type \(k\). Events that are more likely to occur first (in this case \(k=3\)) will have more events and more precise estimates. The estimate for \(S(t|x_1 = 0)\) is the most precise here (all the gray lines are very close to the black line) because it uses information on all the events.

Each individual curve is produced using this function

estimate_CIF <- function(seed, time_seq, covariate_level = data.frame(x1 = 0)) {
  list_models <- seed %>%
    simulate_single_dataset(seed = .) %>%
    create_cause_specific_datasets() %>%
    fit_cause_specific_models()

  transition_matrix <- matrix(data = c(c(NA, 1, 2, 3), rep(NA, 12)), 
                              nrow = 4, ncol = 4, byrow = TRUE)

  multi_state_object <- fmsm(list_models[[1]], 
                             list_models[[2]], 
                             list_models[[3]], 
                             trans = transition_matrix)

  multi_state_object %>% 
    pmatrix.fs(
      trans = transition_matrix, t = time_seq, 
      newdata = covariate_level, tidy = TRUE
    ) %>%
    as_tibble() %>%
    filter(start == 1) %>%
    select(-start) %>%
    pivot_longer(cols = -time, names_to = "Distribution", values_to = "CIF")
}

safely_estimate_CIF <- safely(estimate_CIF)

and the next code runs the pipeline for producing these 100 CIF estimates

Code
time_seq_subset <- time_seq %>% 
  keep(.p = ~ .x <= 10.0)

sampling_distribution_CIFs <- 1:100 %>%
  map(
    .f = ~ {
      print(glue::glue("rep: {.x}"))
      safely_estimate_CIF(
        seed = .x, 
        time_seq = time_seq_subset, 
        covariate_level = data.frame(x1 = 0)
      )
    }
  ) %>%
  # remove the seeds where the estimation failed
  imap(
    .f = ~ {
      res <- pluck(.x, "result")
      if (!is.null(res)) {
        res %>%
          mutate(rep = .y, .before = time)
      } else {
        res
      }
    }
  ) %>%
  compact() %>%
  list_rbind() %>%
  mutate(
    Event = case_when(
      Distribution == "V1" ~ "Event-free",
      Distribution == "V2" ~ "Weibull(shape = 1.6, scale = 5)",
      Distribution == "V3" ~ "Log-logistic(shape = 5, scale = 5)",
      Distribution == "V4" ~ "Exponential(rate = 5)"
    )
  )

# write_rds(sampling_distribution_CIFs, "saved_objects/sampling_distribution_CIFs.rds")

Comparison with survsim::crisk.ncens.sim

I initially started on this project by attempting to use the function from survsim, but I was getting very different results from fitting the models using flexsurv. It was also taking quite long (~6x longer) to produce simulated datasets mostly because of me picking simulation parameters without giving it much thought1. Thinking through the process in detail is what led to this post being much longer than anticipated. This section compares the parameterizations for the hazard functions and simulated draws from survsim and flexsurv.

1 I picked a large value for administrative censoring, which meant that uniroot would take much longer to run due to increased number of iterations to converge to the root, and increased time due to multiple integrate() calls.

Converting flexsurv parameterization to survsim

The hazard function code for Weibull and log-logistic is taken from survsim:::crisk.ncens.sim() here

# code from survsim::crisk.ncens.sim()
#
if (dist.ev[k] == "llogistic") {
  a.ev[k] <- 1/exp(beta0.ev[k] + suma[k])
  b.ev[k] <- anc.ev[k]
  cshaz[[k]] <- function(t, r) {
    par1 <- eval(parse(text="a.ev[r]"))
    par2 <- eval(parse(text="b.ev[r]"))
    z    <- eval(parse(text="az1[r]"))
    return(z*(par1*par2*(t^(par2-1)))/(1+par1*(t^par2)))}
}

if (dist.ev[k] == "weibull") {
  a.ev[k] <- beta0.ev[k] + suma[k]
  b.ev[k] <- anc.ev[k]
  cshaz[[k]] <- function(t, r) {
    par1 <- eval(parse(text="a.ev[r]"))
    par2 <- eval(parse(text="b.ev[r]"))
    z    <- eval(parse(text="az1[r]"))
    return(z * ((1/par2)/((exp(par1))^(1/par2)))*
           t^((1/par2) - 1))}
}

It’s a bit confusing since the user function survsim::crisk.sim() takes anc.ev and beta0.ev as inputs, but these get transformed to a.ev and b.ev and then to par1 and par2 (ok that last transformation is straightforward).

Here’s the Weibull AFT programmed in flexsurv

# Weibull AFT parameterization hazard
flexsurv::hweibull
function (x, shape, scale = 1, log = FALSE)
{
    h <- dbase("weibull", log=log, x=x, shape=shape, scale=scale)
    for (i in seq_along(h)) assign(names(h)[i], h[[i]])
    if (log)
        ret[ind] <- log(shape) + (shape-1)*log(x/scale) - log(scale)
    else
        ret[ind] <- shape * (x/scale)^(shape - 1)/scale
    ret
}
<bytecode: 0x62fd9b6e0820>
<environment: namespace:flexsurv>

If the shape parameter is \(a\) and scale parameter is \(\mu\), the hazard function is

\[ h_{\text{wei}}^{\text{flexsurv}}(t; a, \mu) = a\Bigg(\frac{t}{\mu}\Bigg)^{a - 1} \frac{1}{\mu} = a \mu^{-a}t^{a - 1} \]

and for Weibull AFT from survsim

\[ h_{\text{wei}}^{\text{simsurv}}(t; p_1, p_2) = \frac{1}{p_2} \Bigg(\Big(e^{p_1}\Big)^{1 / p_2}\Bigg)^{-1} t^{\Big(\frac{1}{p_2} - 1\Big)} \]

with beta0.ev = par1 = \(p_1\) and anc.ev = par2 = \(p_2\). This gives the corresponding transformations for Weibull in simsurv, anc.ev = \(1 / a\) (so inverse of flexsurv shape), and beta0.ev = \(\text{log}(\mu)\) (so log of flexsurv scale).

Log-logistic implemented in flexsurv

# log-logistic hazard
flexsurv::hllogis
function(x, shape=1, scale=1, log = FALSE) 
{
    h <- dbase("llogis", log=log, x=x, shape=shape, scale=scale)
    for (i in seq_along(h)) assign(names(h)[i], h[[i]])
    if (log) ret[ind] <- log(shape) - log(scale) + (shape-1)*(log(x) - log(scale)) - log(1 + (x/scale)^shape)
    else ret[ind] <- (shape/scale) * (x/scale)^{shape-1} / (1 + (x/scale)^shape)
    ret
}
<bytecode: 0x62fd99288f80>
<environment: namespace:flexsurv>

if shape is \(a\) and scale is \(b\), the hazard function is

\[ h_{\text{ll}}^{\text{flexsurv}}(t; a, b) = \frac{(a / b) (t / b)^{a - 1}}{1 + (t / b)^a} \]

compared with the one from simsurv (with par1 = \(p_1\) and par2 = \(p_2\))

\[ h_{\text{ll}}^{\text{simsurv}}(t; p_1, p_2) = \frac{p_1 p_2 t^{p_2 - 1}}{1 + p_1 t^{p_2}} \]

\(p_2 = \text{anc.ev} = a\) (shape from flexsurv) and \(p_1 = 1 / \text{exp}(\text{beta0.ev}) = a \text{log}(b)\) (shape times log(scale) from flexsurv).

survsim function

survsim_sim_fn <- function(n = 1000) {
  survsim::crisk.sim(
    n = n,
    foltime = 5,
    # TTE distr
    nsit = 3,
    dist.ev = c("weibull", "llogistic", "weibull"),
    # anc.ev is 1 / shape, exp(beta0) is scale in flexsurv weibull AFT
    # anc.ev is shape, beta0.ev is shape * log(scale) in flexsurv llogis
    anc.ev = c(1 / 1.6, 5, 1),
    beta0.ev = c(log(5), 5 * log(5), log(5)),
    # covariate process
    x = list(c("bern", 0.5)),
    beta = list(c(1.4), c(0), c(0)),
    # censoring process, assume constant here to only apply administrative censoring
    dist.cens = "unif", beta0.cens = 500, anc.cens = 500
  ) %>%
    as_tibble()
}

survsim_sim_fn(n = 10)
# A tibble: 10 × 8
     nid cause   time status start stop      z     x
   <int> <dbl>  <dbl>  <dbl> <lgl> <lgl> <dbl> <dbl>
 1     1    NA 5           0 NA    NA        1     0
 2     2     1 3.79        1 NA    NA        1     0
 3     3     3 0.321       1 NA    NA        1     0
 4     4     3 1.02        1 NA    NA        1     1
 5     5     1 2.38        1 NA    NA        1     1
 6     6     3 2.45        1 NA    NA        1     1
 7     7     3 0.0557      1 NA    NA        1     0
 8     8     3 0.0681      1 NA    NA        1     0
 9     9     3 0.0741      1 NA    NA        1     1
10    10     1 2.34        1 NA    NA        1     0

The \(z\) column is for individual frailties.

Compare run times

The next figure shows a dot plot of the execution time distribution by repeating the process 10 times for both functions (again previously run results read back in)

Code
# survsim_bench <- bench::mark(
#   survsim_sim_fn(), iterations = 10, 
#   check = FALSE, time_unit = "s", memory = FALSE
# )
# write_rds(survsim_bench, file = "saved_objects/survsim_bench.rds")

survsim_bench <- read_rds("saved_objects/survsim_bench.rds")

# my_implementation_bench <- bench::mark(
#   run_pipeline(sample(1:100, size = 1, replace = TRUE)),
#   iterations = 10, check = FALSE, time_unit = "s", memory = FALSE
# )
# write_rds(my_implementation_bench, file = "saved_objects/my_implementation_bench.rds")

my_implementation_bench <- read_rds("saved_objects/my_implementation_bench.rds")

# dot plot of the run times
tibble(
  time = c(survsim_bench %>% pluck("time", 1), 
           my_implementation_bench %>% pluck("time", 1)),
  group = rep(c("survsim", "handrolled"), each = 10)
) %>%
  ggplot(aes(x = time, y = group, fill = group, color = group)) +
  geom_dotplot(binwidth = 0.05,  binaxis = "x", 
               method = "histodot", binpositions = "bygroup", 
               stackdir = "centerwhole") +
  #geom_point(size = 3, position = position_jitter(height = 0.1, seed = 4)) +
  #theme(legend.position = "bottom", legend.title = element_blank()) +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(1, 4, 1), 
                     labels = seq(1, 4, 1), 
                     limits = c(1, 4)) +
  xlab("Time (in seconds)") +
  ylab("") + 
  scale_color_manual(values = c("orange", "gray30")) + 
  scale_fill_manual(values = c("orange", "gray30"))

shows about 2x speedup; I’m assuming most of this comes from not having to use numerical integration but using the cumulative hazards programmed within flexsurv.

Compare samples

The next plot shows the empirical distribution functions for 100 samples of size 100 from the two implementations

Code
# eCDFs for a large draw overlaid to compare the draws - they should be overlapping
# set.seed(23)
# survsim_sample <- survsim_sim_fn(n = 10000)
# set.seed(23)
# handrolled_sample <- simulate_single_dataset(n_rows = 10000, seed = NULL)
#
# write_rds(survsim_sample, file = "saved_objects/survsim_sample.rds")
# write_rds(handrolled_sample, file = "saved_objects/handrolled_sample.rds")

survsim_sample <- read_rds("saved_objects/survsim_sample.rds")
handrolled_sample <- read_rds("saved_objects/handrolled_sample.rds")

bind_rows(
  survsim_sample %>%
    select(time) %>%
    mutate(group = "survsim"),
  handrolled_sample %>%
    select(time) %>%
    mutate(group = "handrolled")
) %>%
  mutate(rep = sample(1:100, size = 20000, replace = TRUE)) %>%
  ggplot(aes(x = time, group = interaction(group, rep), color = group)) +
  stat_ecdf(pad = FALSE, alpha = 0.8) +
  xlab("Simulated Times") +
  ylab("Empirical CDF") +
  theme_classic() +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_color_manual(values = c("orange", "gray30"))

It seems that the gray curves (survsim) are a bit higher than the ones in orange (my implementation). I haven’t looked into why, but it seems likely to be due to sampling variation.

References

The main refs for the material in this post

  • Beyersmann, J., Latouche, A., Buchholz, A., & Schumacher, M. (2009). Simulating competing risks data in survival analysis. Statistics in medicine, 28(6), 956-971.

  • Allignol, A., Schumacher, M., Wanner, C. et al. Understanding competing risks: a simulation point of view. BMC Med Res Methodol 11, 86 (2011). https://doi.org/10.1186/1471-2288-11-86 - what I found cool is that this paper uses the non-parametric estimate of the baseline hazard function to simulate proportional hazard data from the cox model

  • Crowther, M. J., & Lambert, P. C. (2012). Simulating Complex Survival Data. The Stata Journal, 12(4), 674-687. https://doi.org/10.1177/1536867X1201200407 (Original work published 2012)

  • Putter, H., Fiocco, M. and Geskus, R.B. (2007), Tutorial in biostatistics: competing risks and multi-state models. Statist. Med., 26: 2389-2430. https://doi.org/10.1002/sim.2712 - key paper on competing risk analyses

  • Hinchliffe, S.R., Lambert, P.C. Flexible parametric modelling of cause-specific hazards to estimate cumulative incidence functions. BMC Med Res Methodol 13, 13 (2013). https://doi.org/10.1186/1471-2288-13-13

  • flexsurv package vignette on multi-state modelling

  • flexsurv package introduction vignette

  • flexsurv vignette on distributions

  • survsim package docs

Some other papers / package vignettes I came across and read (parts of if not entirely)

  • survival package multi-state vignette

  • survival package vignette reproducing Putter et al analysis

  • Andersen PK, Geskus RB, de Witte T, Putter H. Competing risks in epidemiology: possibilities and pitfalls. Int J Epidemiol. 2012 Jun;41(3):861-70. doi: 10.1093/ije/dyr213. Epub 2012 Jan 9. PMID: 22253319; PMCID: PMC3396320.

  • Austin PC, Lee DS, Fine JP. Introduction to the Analysis of Survival Data in the Presence of Competing Risks. Circulation. 2016 Feb 9;133(6):601-9. doi: 10.1161/CIRCULATIONAHA.115.017719. PMID: 26858290; PMCID: PMC4741409.

  • Fine, J. P., & Gray, R. J. (1999). A Proportional Hazards Model for the Subdistribution of a Competing Risk. Journal of the American Statistical Association, 94(446), 496–509. https://doi.org/10.2307/2670170

  • Latouche A, Allignol A, Beyersmann J, Labopin M, Fine JP. A competing risks analysis should report results on all cause-specific hazards and cumulative incidence functions. J Clin Epidemiol. 2013 Jun;66(6):648-53. doi: 10.1016/j.jclinepi.2012.09.017. Epub 2013 Feb 14. PMID: 23415868.

  • Scheike, T. H., & Zhang, M.-J. (2011). Analyzing Competing Risk Data Using the R timereg Package. Journal of Statistical Software, 38(2), 1–15. https://doi.org/10.18637/jss.v038.i02

  • Thomas H. Scheike, Mei-Jie Zhang, Thomas A. Gerds, Predicting cumulative incidence probability by direct binomial regression, Biometrika, Volume 95, Issue 1, March 2008, Pages 205–220, https://doi.org/10.1093/biomet/asm096

  • Edouard F Bonneville, Liesbeth C de Wreede, Hein Putter, Why you should avoid using multiple Fine–Gray models: insights from (attempts at) simulating proportional subdistribution hazards data, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 187, Issue 3, August 2024, Pages 580–593, https://doi.org/10.1093/jrsssa/qnae056