load(here("data/DOT-perRule-perStage.Rdata"))
dotStage %<>% filter(!is.na(initiated)) # select rule-stage observations with a dot report

With event data, we are often interested in two kinds of outcomes: (1) what happens? and (2) how long does it take?

Classic Cox regression (a.k.a. duration model, survival model, hazard model, proportional hazad model) is designed to awnswer the second question with respect to one outcomes. A competing risks models extend this model to multiple ourcomes, thereby helping to address the first question. Competing risks are assessed by extimating multiple cox models with a restriction that the probability of all events must sum to one.

Often, however, the processes we study have multipld stages (or “states”).

For example, in this paper with Susan Yackee, we are interested in which policy initiatives become law and which are withdrawn. But on the way to each final outcome, policies pass through important stages, like having an official draft published. A policy initiative can be withdrawn either before a draft is published or after, so we want to model policies first as potentially going from initiated to either official draft or withdrawn and second as potentially going from official draft to law or withdrawn. We want to know both what happens at each stage (the probability of transtioning from one state (e.g. official draft) to another (e.g. withdrawn) and how long it takes.

One way to model this is in a multi-state Cox model. Each stage is a competing risk model (i.e. a restricted set of Cox models) common frailities are estimated across stages. This allows us to simultaneously model how different factors may affect a policy at different stages (i.e. states), while also modeling a common frailty.

In our paper, we use monthly status reports.

The two questions above demands a disctinct statistics.

What happens? First, we estimate “transition probabilities” (i.e. the probability that a policy will advance to each possible future state at a given time) using the Aalen-Johansen estimator and the cumulative incidence function (the function estimating the number of observations that have transitioned to and from each state over time).

How long does it take? Second, we estimate the probability that a given transition will occur in a given time window using the Kaplan-Meier estimator. Covariates may increase or decrease the likelihood that a given transition will occur, thus increasing or decreasing the estimated rate of advancing from a given stage to a given stage in the policymaking process, and thus the Kaplan-Meier estimate that a given transition will occur after a given date.

To capture common elements of each rule, we use a penalized likelihood a random effects—a frailty model—with frailties for each rulemaking project (each Regulatory Identification Number). These frailities are common across models. This approach is similar to “Surviving Phases: Introducing Multi-state Survival Models” by Metzger and Jones, who provide useful illustrations in their appendix.

We use mstate, an R package has been developed for multi-state duration modeling. While the vignette does not use tidy data, the JSS article introducing mstate and the multi-state and competing risk vignette offer a nice introduction to the method. Rpubs also has a useful review of survival analysis. The TPmsm package also estimates transition probabilities.

One important gap in the mstate documentation is its discription of the newdata argument in the transprob function. In short, in order to calculate transition probabilities, we need a data frame of values at which to fix covariates–think predicted probabilities from a logit regression. (More on this below.)

There are several ggplot wrappers for plotting survival objects, including ggsurvplot and survminer, and ggfortify allows survfit to be used by autoplot(), but I opt to skip the wrappers and use tidy data. The tidy() function from the broom package converts survfit objects into tidy data frames (see tidy.survfit documentation). Unfortunately, tidy() does not work for mstate and probtrans objects, so tidying those objects takes more work.

Using mstate

First, we model the most common stages that rules go through with no covariates, only the dates of transition in and out of the pre-NPRM and NPRM stages. We use the DOT data with one observation per rule per stage.

Tidy transition event data

Formatting data for multi-state duration modeling requires one to identify the dates of transitions to and from each stage. To do this we must define and index a set of transitions of interest.

The data begin as one observation per rule per transition. The key variable is the days between transitions (i.e. the days the policy spent in a state).

# PREP DATA ###########################
# Here we take event data with rows indexed by ID and columns for relevant dates and covariates.
# The method takes date variables, melts the data to identify the order in which various transitions occur, and, finally, models transitions in and out of each stage as a multi-state survival model. 
# This is a more generic approach than the mstate package vignettes as it does not require one to specify which transitions are possible or actually present in the data. Instead, I derive the transition matrix from the data.
# It also attempts to keep data in a tidy format (rather than special class objects). 

# First we identify the order in which events occurred in order to identify different types of transitions. 

# STEP 1: IDENTIFY TRANSITIONS AND DATES 

# These data start as one observation per ID (in this case, "RIN"). I will be extending this approach to data frames of monthly observations for each RIN
load(here("data/DOT-perRule.Rdata"))

# First, identify an ID variable so that we can split out transitions observed for each ID 
d <- original <- rename(dotRIN, id = RIN)

# Select a set of events to model (We will merge back in with covariates later.)
d %<>% select(id, Initiated, NPRMpublished, FinalRulePublished, WithdrawalPublished)

# Melt data to be one observation per event
d %<>% 
  melt(id = "id", na.rm = T, value.name = "transition_date") %>% 
  distinct()

# Convert dates to strings for now to make things easier (specifically, the super-useful mutate(ifelse()) method). 
d$transition_date %<>% as.character()

# Group by RIN and identify sequence
d %<>% 
  group_by(id) %>% 
  arrange(transition_date) %>% 
  rename(event = variable) %>%
  # clean up text for this example
  mutate(event = str_replace(event, " *$|published|Published", "")) %>%
  mutate(event = str_replace(event, "FinalRule", "Final Rule")) %>%
  # make a variable describing the path each took
  mutate(transitions = paste(event, collapse = " to ")) 
  
# Parse transitions from sequence (how do I make this more general?)
d %<>% separate(transitions, into= c("t1", "t2", "t3", "t4", "t5"), sep=" to ") %>% 
  #mutate(Initiated = t1) %>% 
  mutate(t1 = paste(t1, "to", t2)) %>% 
  mutate(t2 = paste(t2, "to", t3)) %>% 
  mutate(t3 = paste(t3, "to", t4)) %>% 
  mutate(t4 = paste(t4, "to", t5))

# melt back down to one observation per id per transition event
d %<>% melt(id = c("id", "event", "transition_date"), na.rm = T, value.name = "transition") %>% 
  rename(transition_count = variable) %>% 
  filter( !grepl(" to NA", transition) )

# identify entry and exit dates for each state 
d %<>% 
  group_by(id, transition) %>% 
  mutate(exit_date = ifelse(event == gsub(".* to ", "", transition),
                            transition_date, 
                            NA)) %>% 
  mutate(exit_date = max(na.omit(exit_date))) %>%
  mutate(entry_date = ifelse(event == gsub(" to .*", "", transition),
                            transition_date, 
                            NA)) %>% 
  mutate(entry_date = max(na.omit(entry_date) ) )

# Now that we have identified transitions and the entry and exit dates, we no longer need the `event` and `transition_date` variables 
# However, we will need the event names for the tranition matrix, so we save them
events <- as.character(unique(d$event))
d %<>% select(-event, -transition_date) %>% distinct()

###############################
# Merge covariats back into event data
# STEP 2: MERGE BACK WITH DATA
d %<>% left_join(original) %>% ungroup() %>% arrange(id) 

# Count days until event with `difftime()`
d %<>% 
  mutate(exit = as.numeric(difftime(exit_date, Initiated, units="days") ) )%>% 
  mutate(entry = as.numeric(difftime(entry_date, Initiated, units="days") ) ) 

# Days until initiated is obviously 0, but difftime has odd rounding, so let us fix that:
d$entry[grepl("^Init", d$transition)] <- 0

# NOTE THERE ARE SOME ERRORS IN THESE DATA (at least one left) THAT MAY NEED TO BE INVESTIGATED IF THEY PERSIST AFTER PROPER CODING OF TRANSITIONS 
d %<>% filter(exit > entry) # FIXME

# Finally, we define the time until the event
d %<>% mutate(time = exit - entry)


  
##########################################################
# INDEX TRANSITIONS 
d$transition %<>% as.factor()
trans <- tlevels <- levels(d$transition)

# Possible transitions from state i
for(i in 1:length(events)){
  trans <- gsub(events[i], i, trans)
}

# Make a trasition matrix (indexing transitions to and from each state)
trans <- str_split(trans, " to ", simplify = TRUE)

# Use the matrix to populate a list for the transMat()
from <- list()
for(i in 1:length(events)){
  from[[i]] <- as.numeric(trans[which(trans[,1]==i),2])
}

# Transition Matrix
tmat <- transMat(from, names = events)



# ALL OBS ARE TRANSITIONS, HOWEVER, WE CAN INCLUDE NON TRANSITION OBS (i.e. months where no transition happened) -- DOES IT HELP? Perhaps only with covariates that have values for those non-transition months?
d$status <- 1

save(d, file = here("data/2-stage.Rdata"))

In estimation, the multi-stage duration model is stratified by transition AND fitted per transition.

To model this, we use the multi-state method. When Type = "mstate" the left-hand side is broken out by transition–i.e. instead of predicting overall rate we are predicting the rate per transition, given that the observation is going to see that transition (strata(transition) on the right side).


We use msfit() to extract transition probabilities and fitted values (hazards).

Statistic 1: Transition probabilities (i.e. which transition will we see?)

Question: What is the relationship between transition probabilities and cumulative incidence?

NOTE: this plot is backwards because msfit() and probtrans mess up the factor levels of the transition variable and I need to retain those new levels to match the to and from states.

# MODEL WITH NO COVARIATES 
d$trans <-d$transition
mod.1 <- coxph(Surv(entry, exit, status) ~ strata(trans), data = d, method = "breslow")

# FIXME
# msfit from a coxph does not allow multinomial left hand side? wtf?
# # Extract fitted values
fit <- msfit(mod.1, trans=tmat)

# Extract transition probabilities from each state
# pred = # A positive number indicating the prediction time--the time at which the prediction is made
pt0 <- probtrans(fit,  predt = 1) 

# tidy the mstate object
trans <- as.data.frame(pt0[[5]]) # transition matrix
p <- pt0[[1]] # estimate matrix
p$from <- paste(1, names(trans)[1]) # names from transition matrix
for(i in 2:length(events)){
  pt <- pt0[[i]]
  pt$from <- paste(i, names(trans)[i])
  p <- rbind(p, pt)
}

# p$pstate2[which(p$from == "1 Initiated")] <- p$pstate2[which(p$from == "1 Initiated")] + p$pstate1[which(p$from == "2 NPRM")] + 
#   p$pstate3[which(p$from == "2 NPRM")] + 
#   p$pstate4[which(p$from == "2 NPRM")]

# melt into one observation per transition per time
p %<>% melt(id = c("time", "from"), value.name = "Probability")

# separate state from variable (estimate, error) name
p %<>% separate(variable, into = c("variable", "to"), sep = "e") %>%
  # spread estimates and errors into two variables
  spread(key = "variable", value = "Probability") %>%
  # name estimate and error variables
  rename(estimate = pstat, std.error = s)

# name transitions
p$to %<>% as.factor()
p$from %<>% as.factor()
levels(p$to) <- levels(p$from)


p %<>% 
  mutate(Transition = paste0("Pr(",to,"|",from,")")) %>%
  mutate(from = factor(paste("At state", from))) %>% 
  mutate(to = factor(paste("Pr ", to))) 


# plot
p %>% filter(time <3650, !estimate %in% c(0,1) ) %>% 
  ggplot(aes(x = time/365, color = Transition, fill = Transition) ) +
  geom_step(aes(y = estimate)) +
  geom_ribbon(aes(ymax = estimate + 1.96*std.error, 
                  ymin = estimate - 1.96*std.error),
              alpha = .2, color = NA) + 
  #geom_text(aes(y = max, x = max(time/365), label = paste0(round(max*100), "%")), hjust = 1, vjust = -.2, color = "black") +
  labs(x = "Years" , y = "Transition probability
from Aalen-Johansen estimator") +
  #facet_grid(. ~ from) + # for comparing states without covariates 
  facet_grid(to ~ from) + # for comparing covariates within states
  theme_bw()

Question: Why does msfit probtrans() with covariates require a newdata argument?

The model is nearly identical to the one above:

mod.1 <- coxph(Surv(time, status) ~ MAJOR + strata(trans), data = d, method = "breslow")

But extracting fitted values requires “newdata”—why? The documentation does not say.

fit <- msfit(mod.1, trans=tmat)


Statistic 2: Rate/Cumulative Incidence (How long does a given transition take?)

fit <- survfit(Surv(entry, exit, status * as.numeric(transition), type = "mstate") ~ strata(transition),
                 data = d) %>% tidy()
# Status is now multinomial by trasition. This will help add non-event time observations (i.e. status !=1)?

fit <- survfit(Surv(entry, exit, as.numeric(transition), type = "mstate") ~ strata(transition),
                 data = d) %>% tidy()
# FIXME
fit %<>%   filter(estimate != 0) 
# why does this fit return 0's for values of x where there are not observations? 
# maybe these are what is returned when the transitions on the left and right dont match? 

# replace numeric transition index with transition names (note: levels with too few observations may be dropped)
fit$state %<>% as.factor()
# levels(fit$state)
levels(fit$state) <- levels(d$transition)[as.numeric(levels(fit$state))]
fit$Transition <- fit$state

# make faceting variables for each current and future state
fit %<>% 
  mutate(to = gsub(".*=|.* to ", "to ", Transition) ) %>%
  mutate(from = factor(paste("from", gsub(".*=| to .*","", Transition)) ) ) # fit$to
# fit$from

# plot
fit %>% 
  ggplot(aes(x = time/365, color = Transition, fill = Transition) ) +
  geom_step(aes(y = estimate)) +
    geom_point(aes(y = estimate), shape = "+") +
  geom_ribbon(aes(ymax = estimate + 1.96*std.error, 
                  ymin = estimate - 1.96*std.error),
              alpha = .2, color = NA) + 
  ylim(0, 1) + 
  labs(x = "Years" , y = "Cumulative incidence of each transition") +
  # note without strata trans, we can get this function parsed by transition, but I am not sure it is right 
  # labs(x = "Years" , y = "Cumulative incidence of all transitions") +
  facet_grid(. ~ from) + # for comparing states without covariates 
  #facet_grid(to ~ from) + # for comparing covariates within states
  theme_bw()

How long does a given transition take? (by major vs. non-major rules)

Now the mstate model with a covariate:

fit <- survfit(Surv(time, status * as.numeric(transition), type = "mstate") ~ MAJOR + strata(transition),
               data = d) %>% tidy()
# FIXME
fit %<>%   filter(estimate != 0) 

# replace numeric transition index with transition names (note: levels with too few observations may be dropped)
fit$state %<>% as.factor()
levels(fit$state) <- levels(d$transition)[as.numeric(levels(fit$state))]
fit$Transition <- fit$state

# make faceting variables for each current and future state
fit %<>% 
  mutate(to = gsub(".*=|.* to ", "to ", Transition) ) %>%
  mutate(from = factor(paste("from", gsub(".*=| to .*","", Transition)) ) ) 
# trim srata 
fit$strata %<>% str_replace(",.*", "") 

# plot
fit %>% filter(time < 3650) %>%
  ggplot(aes(x = time/365, color = strata, fill = strata) ) +
  geom_step(aes(y = estimate)) +
    geom_point(aes(y = estimate), shape = "+") +
  geom_ribbon(aes(ymax = estimate + 1.96*std.error, 
                  ymin = estimate - 1.96*std.error),
              alpha = .2, color = NA) + 
  # geom_text(aes(y = max, x = max(time/365), label = paste0(round(max*100), "%")), hjust = 1, vjust = -.2, color = "black") +
  labs(x = "Years" , y = "Cumulative incidence of all transitions") +
  #facet_grid(. ~ from) + # for comparing states without covariates 
  facet_grid(to ~ from)# for comparing covariates within states


Instead of cumulative incidence, we can also look at the hazard rate directly:

Hazard rates stratified by transition from the mstate package:

# NOTE THIS IS CHUNK IS OLD CODE WRITTEN SPECIFICALLY FOR THIS CASE
# Thus, it has a different variable name than above.  

# select vars for minimal model 
d <- select(dotStage, DOTdate, RIN, STAGE, initiated, DaysUntilNPRM, DaysUntilFinalRule, DaysUntilWithdrawal) %>%
  # select minimal stages
  filter(!is.na(initiated), !is.na(DOTdate), STAGE %in% c("Proposed Rule", "Final Rule", "Withdrawal")) %>% 
  distinct() %>% 
  arrange(DOTdate) %>% arrange(RIN)

# Days until event 
# FIXME
# note, bellow subs last observed date as published date where missing; not always correct
d %<>% mutate(DaysUntilNPRM = ifelse(is.na(DaysUntilNPRM) & STAGE == "Proposed Rule", DOTdate - initiated, DaysUntilNPRM))
d %<>% mutate(DaysUntilFinalRule = ifelse(is.na(DaysUntilFinalRule) & STAGE == "Final Rule", DOTdate - initiated, DaysUntilFinalRule))
d %<>% mutate(DaysUntilWithdrawal = ifelse(is.na(DaysUntilWithdrawal) & STAGE == "Withdrawal", DOTdate - initiated, DaysUntilWithdrawal))


# identify previous stages in the data (need to use previous stage var to supplement this)
d %<>% group_by(RIN) %>%
  mutate(stages = paste(STAGE, collapse = ";")) %>% ungroup() 

d$trans <- NA
# to proposed rule = trans 1 
d$trans[grepl("^Proposed Rule", d$stages) & d$STAGE == "Proposed Rule"] <- 1

# direct to final = trans 2 
d$trans[d$stages =="Final Rule"] <- 2

# withdrawal = trans 3
d$trans[d$stages == "Proposed Rule;Withdrawal" & d$STAGE == "Withdrawal"] <- 3

# proposed to final = trans 4 
d$trans[grepl("Proposed Rule;Final Rule", d$stages) & d$STAGE == "Final Rule"] <- 4

# Transition Matrix 
tmat <- transMat(list(c(2, 4), #transitions from prerule
                      c(3,4),                    #transitions from nprm
                      c(),                    #transitions from withdrawal
                      c()),        # trans from final 
                 names = c("preNPRM", "NPRM", "Withdrawal", "Final"))


# USE INDEX TO ID ENTRY AND EXIT DATES  
# prerule stage: entry = initiated, exit = NPRM published, status = 1 if DOT report date = NPRM published date

# obs entering at date initiated
d %<>% mutate(entry = ifelse(trans %in% c(1,2), 0, NA))

# obs entering at NPRM date 
d %<>% mutate(entry = ifelse(trans %in% c(3,4), DaysUntilNPRM, entry))

# obs exiting at Final Rule published date
d %<>% mutate(exit = ifelse(trans %in% c(2,4), DaysUntilFinalRule, NA))

# obs exiting at Withdrawal published date
d %<>% mutate(exit = ifelse(trans %in% c(3), DaysUntilWithdrawal, exit))

# obs exiting at NPRM published date
d %<>% mutate(exit = ifelse(trans %in% c(1), DaysUntilNPRM, exit))

# NOTE THERE ARE SOME ERRORS THAT MAY NEED TO BE INVESTIGATED IF THEY PERSIST AFTER PROPER CODING OF TRANSITIONS 
d %<>% filter(exit > entry) # FIXME

# ALL OBS ARE TRANSITIONS, HOWEVER, WE CAN INCLUDE NON TRANSITION OBS (i.e. months where no transition happened) -- DOES IT HELP? Perhaps only with covariates that have values for those non-transition months?
d$status <- 1
# MODEL WITH NO COVARIATES 
mod.1 <- coxph(Surv(entry, exit, status) ~ strata(trans), data = d, method = "breslow")

# # INSPECT 
# summary(mod.1)
# class(mod.1$xlevels)

# # Extract fitted values
fit <- msfit(mod.1, trans=tmat, variance=TRUE)

# CAUTION NOTE: in one run, Withdrawals got dropped, making 4 (NPRM to Final) the 3rd factor. Not cool.
f <- fit$Haz

# Name stage for trans 1 and 2
f %<>% mutate(Stage = ifelse(trans %in% c(1,2), " Pre NPRM Stage", "Post NPRM Stage") )

# Name transitions
f$trans %<>% {gsub(1, "Inititiated to NPRM", .)}
f$trans %<>% {gsub(2, "Direct to Final", .)}
f$trans %<>% {gsub(3, "NPRM to Withdrawal", .)} 
f$trans %<>% {gsub(4, "NPRM to Final", .)} 

# Name variables
names(f) <- c("Days", "Hazard", "Transition", "Stage")
# plot
ggplot(f) +
  geom_line(aes(x = Days, y = Hazard, color = Transition)) + 
  facet_grid(. ~ Stage) + 
  theme_bw()

Most common transition patterns

Without “Undetermined,” “Completed,” or “Other”

Uncommon patterns

Reporting results

The values at which we fix covariates matter for the translation of coefficients to hazards; it is not just an intercept shift like linear regression. Consider this simple example:

\(\lambda_t = \lambda_0*e^{(x_1*\beta_1 + x_2*beta_2)}.\)

Let \(\beta_1\) and \(beta_2 = 1\) and \(\lambda_0 = 1\)

If we fix \(x1\) at 1: \(\lambda_t = e^{(1 + x_2)} = e*e^{(x_2)}\)

If we fix x1 at 2: \(\lambda_t = e^{(2 + x_2)} = e^2*e^{(x_2)}\)

So, we need to think carefully about the covariate values at which we want to report effects.