Log-Normal Regression for Duration Dependent Variables
The log-normal model describes an event’s duration, the dependent variable, as a function of a set of explanatory variables. The log-normal model may take time censored dependent variables, and allows the hazard rate to increase and decrease.
With reference classes:
z5 <- zlognorm$new()
z5$zelig(Surv(Y, C) ~ X, weights = w, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Surv(Y, C) ~ X, model = "lognorm", weights = w, data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
Log-normal models require that the dependent variable be in the form Surv(Y, C), where Y and C are vectors of length . For each observation in 1, …, , the value is the duration (lifetime, for example) of each subject, and the associated is a binary variable such that if the duration is not censored (e.g., the subject dies during the study) or if the duration is censored (e.g., the subject is still alive at the end of the study). If is omitted, all Y are assumed to be completed; that is, time defaults to 1 for all observations.
In addition to the standard inputs, zelig() takes the following additional options for lognormal regression:
z.out <- zelig(y ~ x1 + x2, robust = TRUE, cluster = "x3", model = "exp", data = mydata)
means that the observations can be correlated within the strata defined by the variable x3, and that robust standard errors should be calculated according to those clusters. If robust = TRUE but cluster is not specified, zelig() assumes that each observation falls into its own cluster.
Attach the sample data:
data(coalition)
Estimate the model:
z.out <- zelig(Surv(duration, ciep12) ~ fract + numst2, model ="lognorm", data = coalition)
## Warning in readLines(zeligmixedmodels): incomplete final line found on
## '/usr/lib64/R/library/ZeligMultilevel/JSON/zelig5mixedmodels.json'
## How to cite this model in Zelig:
## Terry M Therneau, and Thomas Lumley. 2007.
## lognorm: Log-Normal Regression for Duration Dependent Variables
## in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
## "Zelig: Everyone's Statistical Software," http://zeligproject.org/
View the regression output:
summary(z.out)
## Model:
##
## Call:
## z5$zelig(formula = Surv(duration, ciep12) ~ fract + numst2, data = coalition)
## Value Std. Error z p
## (Intercept) 5.36667 0.597517 8.98 2.67e-19
## fract -0.00444 0.000806 -5.51 3.65e-08
## numst2 0.55983 0.142351 3.93 8.40e-05
## Log(scale) 0.18239 0.044214 4.13 3.70e-05
##
## Scale= 1.2
##
## Log Normal distribution
## Loglik(model)= -1078 Loglik(intercept only)= -1101
## Chisq= 46.6 on 2 degrees of freedom, p= 7.7e-11
## Number of Newton-Raphson Iterations: 3
## n= 314
##
## Next step: Use 'setx' method
Set the baseline values (with the ruling coalition in the minority) and the alternative values (with the ruling coalition in the majority) for X:
x.low <- setx(z.out, numst2 = 0)
x.high <- setx(z.out, numst2= 1)
Simulate expected values (qi$ev) and first differences (qi$fd):
s.out <- sim(z.out, x = x.low, x1 = x.high)
summary(s.out)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 18.3 2.42 18.2 14.1 23.4
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 16.4 24 8.2 0.869 83.7
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 31.9 3.57 31.7 25.3 39.4
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 33.1 64.1 15.7 1.74 163
## fd
## mean sd 50% 2.5% 97.5%
## 1 13.6 3.57 13.5 7.02 20.7
plot(s.out)
Let be the survival time for observation with the density function and the corresponding distribution function . This variable might be censored for some observations at a fixed time such that the fully observed dependent variable, , is defined as
y_c \\ \end{array} \right."/>
The stochastic component is described by the distribution of the partially observed variable, . For the lognormal model, there are two equivalent representations:
where the parameters and are the mean and variance of the Normal distribution. (Note that the output from zelig() parameterizes scale:math:` = sigma`.)
In addition, survival models like the lognormal have three additional properties. The hazard function measures the probability of not surviving past time given survival up to . In general, the hazard function is equal to where the survival function represents the fraction still surviving at time . The cumulative hazard function describes the probability of dying before time . In general, . In the case of the lognormal model,
where is the cumulative density function for the Normal distribution.
The systematic component is described as:
The expected values (qi$ev) for the lognormal model are simulations of the expected duration:
given draws of and from their sampling distributions.
The predicted value is a draw from the log-normal distribution given simulations of the parameters .
The first difference (qi$fd) is
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. When is censored rather than observed, we replace it with a simulation from the model given available knowledge of the censoring process. Variation in the simulations is due to two factors: uncertainty in the imputation process for censored and uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. When is censored rather than observed, we replace it with a simulation from the model given available knowledge of the censoring process. Variation in the simulations are due to two factors: uncertainty in the imputation process for censored and uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The Zelig object stores fields containing everything needed to rerun the Zelig output, and all the results and simulations as they are generated. In addition to the summary commands demonstrated above, some simply utility functions (known as getters) provide easy access to the raw fields most commonly of use for further investigation.
In the example above z.out$getcoef() returns the estimated coefficients, z.out$getvcov() returns the estimated covariance matrix, and z.out$getpredict() provides predicted values for all observations in the dataset from the analysis.
The exponential function is part of the survival library by by Terry Therneau, ported to R by Thomas Lumley. Advanced users may wish to refer to help(survfit) in the survival library.
Therneau T (2015). A Package for Survival Analysis in S. version 2.38, <URL: https://CRAN.R-project.org/package=survival>.
Terry M. Therneau and Patricia M. Grambsch (2000). Modeling Survival Data: Extending the Cox Model. Springer, New York. ISBN 0-387-98784-3.