This section includes technical information on the models currently implemented in Zelig (5.0-1). This includes a reference with a list of supported models as well as individual model vignettes with detailed information on the model, quantities of interest and syntax.
The following models are currently supported in Zelig 5.0-1:
Exponential Regression for Duration Dependent Variables
Use the exponential duration regression model if you have a dependent variable representing a duration (time until an event). The model assumes a constant hazard rate for all events. The dependent variable may be censored (for observations have not yet been completed when data were collected).
With reference classes:
z5 <- zexp$new()
z5$zelig(Surv(Y, C) ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Surv(Y, C) ~ X, model = "exp", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
Exponential models require that the dependent variable be in the form Surv(Y, C), where Y and C are vectors of length . For each observation in 1, …, , the value is the duration (lifetime, for example), and the associated is a binary variable such that if the duration is not censored (e.g., the subject dies during the study) or if the duration is censored (e.g., the subject is still alive at the end of the study and is know to live at least as long as ). If is omitted, all Y are assumed to be completed; that is, time defaults to 1 for all observations.
In addition to the standard inputs, zelig() takes the following additional options for exponential regression:
z.out <- zelig(y ~ x1 + x2, robust = TRUE, cluster = "x3",
model = "exp", data = mydata)
means that the observations can be correlated within the strata defined by the variable x3, and that robust standard errors should be calculated according to those clusters. If robust = TRUE but cluster is not specified, zelig() assumes that each observation falls into its own cluster.
Attach the sample data:
data(coalition)
Estimate the model:
z.out <- zelig(Surv(duration, ciep12) ~ fract + numst2, model = "exp", data = coalition)
## How to cite this model in Zelig:
## Olivia Lau, Kosuke Imai, Gary King. 2011.
## exp: Exponential Regression for Duration Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
View the regression output:
summary(z.out)
## Model:
## $by
## [1] 1
##
## Call:
## survival::survreg(formula = Surv(duration, ciep12) ~ fract +
## numst2, data = ., dist = "exponential", model = FALSE)
##
## Coefficients:
## (Intercept) fract numst2
## 5.535873 -0.003909 0.461179
##
## Scale fixed at 1
##
## Loglik(model)= -1077 Loglik(intercept only)= -1101
## Chisq= 46.66 on 2 degrees of freedom, p= 7.4e-11
## n= 314
## Next step: Use 'setx' method
Set the baseline values (with the ruling coalition in the minority) and the alternative values (with the ruling coalition in the majority) for X:
x.low <- setx(z.out, numst2 = 0)
## Error: error in evaluating the argument 'x' in selecting a method for function 'terms': Error in Ops.Surv(y, z$residuals) : Invalid operation on a survival time
## Calls: lm -> lm.fit -> Ops.Surv
x.high <- setx(z.out, numst2 = 1)
## Error: error in evaluating the argument 'x' in selecting a method for function 'terms': Error in Ops.Surv(y, z$residuals) : Invalid operation on a survival time
## Calls: lm -> lm.fit -> Ops.Surv
Simulate expected values and first differences:
s.out <- sim(z.out, x = x.low, x1 = x.high)
## Error: object 'x.low' not found
Summarize quantities of interest and produce some plots:
summary(s.out)
## Error: error in evaluating the argument 'object' in selecting a method for function 'summary': Error: object 's.out' not found
plot(s.out)
## Error: error in evaluating the argument 'x' in selecting a method for function 'plot': Error: object 's.out' not found
Let be the survival time for observation . This variable might be censored for some observations at a fixed time such that the fully observed dependent variable, , is defined as
y_c \\ \end{array} \right."/>
The stochastic component is described by the distribution of the partially observed variable . We assume follows the exponential distribution whose density function is given by
for and 0"/>. The mean of this distribution is .
In addition, survival models like the exponential have three additional properties. The hazard function measures the probability of not surviving past time given survival up to . In general, the hazard function is equal to where the survival function represents the fraction still surviving at time . The cumulative hazard function describes the probability of dying before time . In general, . In the case of the exponential model,
For the exponential model, the hazard function is constant over time. The Weibull model and lognormal models allow the hazard function to vary as a function of elapsed time (see and respectively).
The systematic component is modeled as
where is the vector of explanatory variables, and is the vector of coefficients.
The expected values (qi$ev) for the exponential model are simulations of the expected duration given and draws of from its posterior,
The predicted values (qi$pr) are draws from the exponential distribution with rate equal to the expected value.
The first difference (or difference in expected values, qi$ev.diff), is
where and are different vectors of values for the explanatory variables.
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. When is censored rather than observed, we replace it with a simulation from the model given available knowledge of the censoring process. Variation in the simulations is due to two factors: uncertainty in the imputation process for censored and uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. When is censored rather than observed, we replace it with a simulation from the model given available knowledge of the censoring process. Variation in the simulations is due to two factors: uncertainty in the imputation process for censored and uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(Surv(Y, C) ~ X, model = exp, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The exponential function is part of the survival library by Terry Therneau, ported to R by Thomas Lumley. Advanced users may wish to refer to help(survfit) in the survival library.
Gamma Regression for Continuous, Positive Dependent Variables
Use the gamma regression model if you have a positive-valued dependent variable such as the number of years a parliamentary cabinet endures, or the seconds you can stay airborne while jumping. The gamma distribution assumes that all waiting times are complete by the end of the study (censoring is not allowed).
With reference classes:
z5 <- zgamma$new()
z5$zelig(Y ~ X1 + X ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "gamma", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out, x1 = NULL)
Attach the sample data:
data(coalition)
Estimate the model:
z.out <- zelig(duration ~ fract + numst2, model = "gamma", data = coalition)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2007.
## gamma: Gamma Regression for Continuous, Positive Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
View the regression output:
summary(z.out)
## Model:
## $by
## [1] 1
##
##
## Call: stats::glm(formula = duration ~ fract + numst2, family = Gamma("inverse"),
## data = .)
##
## Coefficients:
## (Intercept) fract numst2
## -0.012960 0.000115 -0.017387
##
## Degrees of Freedom: 313 Total (i.e. Null); 311 Residual
## Null Deviance: 301
## Residual Deviance: 272 AIC: 2430
## Next step: Use 'setx' method
Set the baseline values (with the ruling coalition in the minority) and the alternative values (with the ruling coalition in the majority) for X:
x.low <- setx(z.out, numst2 = 0)
x.high <- setx(z.out, numst2 = 1)
Simulate expected values (qi$ev) and first differences (qi$fd):
s.out <- sim(z.out, x = x.low, x1 = x.high)
summary(s.out)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 14.46 1.071 14.38 12.57 16.68
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 14.87 13.61 10.9 0.7095 49.69
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 19.16 1.102 19.11 17.19 21.34
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 18.85 17.62 14.12 0.9335 68.06
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 4.693 1.516 4.723 1.713 7.475
plot(s.out)
The Gamma distribution with scale parameter has a stochastic component:
The systematic component is given by
The expected values (qi$ev) are simulations of the mean of the stochastic component given draws of and from their posteriors:
The predicted values (qi$pr) are draws from the gamma distribution for each given set of parameters .
If x1 is specified, sim() also returns the differences in the expected values (qi$fd),
.
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = gamma, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The gamma model is part of the stats package. Advanced users may wish to refer to help(glm) and help(family).
Logistic Regression for Dichotomous Dependent Variables
Logistic regression specifies a dichotomous dependent variable as a function of a set of explanatory variables.
With reference classes:
z5 <- zlogit$new()
z5$zelig(Y ~ X1 + X ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "logit", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out, x1 = NULL)
Attaching the sample turnout dataset:
data(turnout)
Estimating parameter values for the logistic regression:
z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2007.
## logit: Logistic Regression for Dichotomous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Setting values for the explanatory variables:
x.out1 <- setx(z.out1, age = 36, race = "white")
Simulating quantities of interest from the posterior distribution.
s.out1 <- sim(z.out1, x = x.out1)
summary(s.out1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.7486 0.01185 0.7486 0.7243 0.771
## pv
## 0 1
## [1,] 0.255 0.745
plot(s.out1)
Estimating the risk difference (and risk ratio) between low education (25th percentile) and high education (75th percentile) while all the other variables held at their default values.
z.out2 <- zelig(vote ~ race + educate, model = "logit", data = turnout)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2007.
## logit: Logistic Regression for Dichotomous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
x.high <- setx(z.out2, educate = quantile(turnout$educate, prob = 0.75))
x.low <- setx(z.out2, educate = quantile(turnout$educate, prob = 0.25))
s.out2 <- sim(z.out2, x = x.high, x1 = x.low)
summary(s.out2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.8229 0.0102 0.8232 0.8028 0.8413
## pv
## 0 1
## [1,] 0.195 0.805
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.7093 0.01318 0.7091 0.6834 0.7337
## pv
## 0 1
## [1,] 0.291 0.709
## fd
## mean sd 50% 2.5% 97.5%
## [1,] -0.1136 0.01203 -0.1135 -0.1379 -0.09082
plot(s.out2)
Let be the binary dependent variable for observation which takes the value of either 0 or 1.
The stochastic component is given by
where .
The systematic component is given by:
where is the vector of explanatory variables for observation and is the vector of coefficients.
The expected values (qi$ev) for the logit model are simulations of the predicted probability of a success:
given draws of from its sampling distribution.
The predicted values (qi$pr) are draws from the Binomial distribution with mean equal to the simulated expected value .
The first difference (qi$fd) for the logit model is defined as
The risk ratio (qi$rr) is defined as
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = logit, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The logit model is part of the stats package. Advanced users may wish to refer to help(glm) and help(family).
Log-Normal Regression for Duration Dependent Variables
The log-normal model describes an event’s duration, the dependent variable, as a function of a set of explanatory variables. The log-normal model may take time censored dependent variables, and allows the hazard rate to increase and decrease.
With reference classes:
z5 <- zlognorm$new()
z5$zelig(Surv(Y, C) ~ X, data = mydata)
z5$setx()
z5$sim()
With reference classes:
z5 <- zlognorm$new()
z5$zelig(Surv(Y, C) ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Surv(Y, C) ~ X, model = "lognorm", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
Log-normal models require that the dependent variable be in the form Surv(Y, C), where Y and C are vectors of length . For each observation in 1, …, , the value is the duration (lifetime, for example) of each subject, and the associated is a binary variable such that if the duration is not censored (e.g., the subject dies during the study) or if the duration is censored (e.g., the subject is still alive at the end of the study). If is omitted, all Y are assumed to be completed; that is, time defaults to 1 for all observations.
In addition to the standard inputs, zelig() takes the following additional options for lognormal regression:
z.out <- zelig(y ~ x1 + x2, robust = TRUE, cluster = "x3", model = "exp", data = mydata)
means that the observations can be correlated within the strata defined by the variable x3, and that robust standard errors should be calculated according to those clusters. If robust = TRUE but cluster is not specified, zelig() assumes that each observation falls into its own cluster.
Attach the sample data:
data(coalition)
Estimate the model:
z.out <- zelig(Surv(duration, ciep12) ~ fract + numst2, model ="lognorm", data = coalition)
## How to cite this model in Zelig:
## Matthew Owen, Olivia Lau, Kosuke Imai, Gary King. 2007.
## lognorm: Log-Normal Regression for Duration Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
View the regression output:
summary(z.out)
## Model:
## $by
## [1] 1
##
## Call:
## survival::survreg(formula = Surv(duration, ciep12) ~ fract +
## numst2, data = ., dist = "lognormal", model = FALSE)
##
## Coefficients:
## (Intercept) fract numst2
## 5.366670 -0.004438 0.559833
##
## Scale= 1.2
##
## Loglik(model)= -1078 Loglik(intercept only)= -1101
## Chisq= 46.58 on 2 degrees of freedom, p= 7.7e-11
## n= 314
## Next step: Use 'setx' method
Set the baseline values (with the ruling coalition in the minority) and the alternative values (with the ruling coalition in the majority) for X:
x.low <- setx(z.out, numst2 = 0)
## Error: error in evaluating the argument 'x' in selecting a method for function 'terms': Error in Ops.Surv(y, z$residuals) : Invalid operation on a survival time
## Calls: lm -> lm.fit -> Ops.Surv
x.high <- setx(z.out, numst2= 1)
## Error: error in evaluating the argument 'x' in selecting a method for function 'terms': Error in Ops.Surv(y, z$residuals) : Invalid operation on a survival time
## Calls: lm -> lm.fit -> Ops.Surv
Simulate expected values (qi$ev) and first differences (qi$fd):
s.out <- sim(z.out, x = x.low, x1 = x.high)
summary(s.out)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.7091 0.01293 0.7098 0.6829 0.7328
## pv
## 0 1
## [1,] 0.294 0.706
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.8224 0.009923 0.8226 0.8031 0.8417
## pv
## 0 1
## [1,] 0.178 0.822
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 0.1134 0.01112 0.1132 0.09218 0.1363
plot(s.out)
Let be the survival time for observation with the density function and the corresponding distribution function . This variable might be censored for some observations at a fixed time such that the fully observed dependent variable, , is defined as
y_c \\ \end{array} \right."/>
The stochastic component is described by the distribution of the partially observed variable, . For the lognormal model, there are two equivalent representations:
where the parameters and are the mean and variance of the Normal distribution. (Note that the output from zelig() parameterizes scale:math:` = sigma`.)
In addition, survival models like the lognormal have three additional properties. The hazard function measures the probability of not surviving past time given survival up to . In general, the hazard function is equal to where the survival function represents the fraction still surviving at time . The cumulative hazard function describes the probability of dying before time . In general, . In the case of the lognormal model,
where is the cumulative density function for the Normal distribution.
The systematic component is described as:
The expected values (qi$ev) for the lognormal model are simulations of the expected duration:
given draws of and from their sampling distributions.
The predicted value is a draw from the log-normal distribution given simulations of the parameters .
The first difference (qi$fd) is
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. When is censored rather than observed, we replace it with a simulation from the model given available knowledge of the censoring process. Variation in the simulations is due to two factors: uncertainty in the imputation process for censored and uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. When is censored rather than observed, we replace it with a simulation from the model given available knowledge of the censoring process. Variation in the simulations are due to two factors: uncertainty in the imputation process for censored and uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(Surv(Y, C) ~ X, model = lognorm, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The exponential function is part of the survival library by by Terry Therneau, ported to R by Thomas Lumley. Advanced users may wish to refer to help(survfit) in the survival library.
Least Squares Regression for Continuous Dependent Variables
Use least squares regression analysis to estimate the best linear predictor for the specified dependent variables.
With reference classes:
z5 <- zls$new()
z5$zelig(Y ~ X1 + X ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "ls", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
Attach sample data:
data(macro)
Estimate model:
z.out1 <- zelig(unem ~ gdp + capmob + trade, model = "ls", data = macro)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, and Olivia Lau. 2007.
## ls: Least Squares Regression for Continuous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Summarize regression coefficients:
summary(z.out1)
## Model:
## $by
## [1] 1
##
##
## Call:
## stats::lm(formula = unem ~ gdp + capmob + trade, data = .)
##
## Coefficients:
## (Intercept) gdp capmob trade
## 6.1813 -0.3236 1.4219 0.0199
##
## Next step: Use 'setx' method
Set explanatory variables to their default (mean/mode) values, with high (80th percentile) and low (20th percentile) values for the trade variable:
x.high <- setx(z.out1, trade = quantile(macro$trade, 0.8))
x.low <- setx(z.out1, trade = quantile(macro$trade, 0.2))
Generate first differences for the effect of high versus low trade on GDP:
s.out1 <- sim(z.out1, x = x.high, x1 = x.low)
summary(s.out1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 5.434 0.1854 5.433 5.065 5.794
## pv
## mean sd 50% 2.5% 97.5%
## 1 5.434 0.1854 5.433 5.065 5.794
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 4.591 0.1859 4.591 4.223 4.939
## pv
## mean sd 50% 2.5% 97.5%
## 1 4.591 0.1859 4.591 4.223 4.939
## fd
## mean sd 50% 2.5% 97.5%
## 1 -0.8432 0.2323 -0.8312 -1.321 -0.4059
plot(s.out1)
Estimate a model with fixed effects for each country (see for help with dummy variables). Note that you do not need to create dummy variables, as the program will automatically parse the unique values in the selected variable into discrete levels.
z.out2 <- zelig(unem ~ gdp + trade + capmob + as.factor(country), model = "ls", data = macro)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, and Olivia Lau. 2007.
## ls: Least Squares Regression for Continuous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Set values for the explanatory variables, using the default mean/mode values, with country set to the United States and Japan, respectively:
x.US <- setx(z.out2, country = "United States")
x.Japan <- setx(z.out2, country = "Japan")
Simulate quantities of interest:
s.out2 <- sim(z.out2, x = x.US, x1 = x.Japan)
plot(s.out2)
The stochastic component is described by a density with mean and the common variance
The systematic component models the conditional mean as
where is the vector of covariates, and is the vector of coefficients.
The least squares estimator is the best linear predictor of a dependent variable given , and minimizes the sum of squared residuals, .
The expected value (qi$ev) is the mean of simulations from the stochastic component,
given a draw of from its sampling distribution.
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = ls, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out). Other elements available through the $ operator are listed below.
From the zelig() output object z.out, you may extract:
From summary(z.out), you may extract:
coefficients: the parameter estimates with their associated standard errors, -values, and -statistics.
sigma: the square root of the estimate variance of the random error :
r.squared: the fraction of the variance explained by the model.
adj.r.squared: the above statistic, penalizing for an increased number of explanatory variables.
cov.unscaled: a matrix of unscaled covariances.
The least squares regression is part of the stats package by William N. Venables and Brian D. Ripley .In addition, advanced users may wish to refer to help(lm) and help(lm.fit).
Negative Binomial Regression for Event Count Dependent Variables
Use the negative binomial regression if you have a count of events for each observation of your dependent variable. The negative binomial model is frequently used to estimate over-dispersed event count models.
With reference classes:
z5 <- znegbin$new()
z5$zelig(Y ~ X1 + X ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "negbin", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
Load sample data:
data(sanction)
Estimate the model:
z.out <- zelig(num ~ target + coop, model = "negbin", data = sanction)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2008.
## negbinom: Negative Binomial Regression for Event Count Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
summary(z.out)
## Model:
## $by
## [1] 1
##
##
## Call: MASS::glm.nb(formula = num ~ target + coop, data = ., init.theta = 1.841603403,
## link = log)
##
## Coefficients:
## (Intercept) target coop
## -1.564 0.151 1.286
##
## Degrees of Freedom: 77 Total (i.e. Null); 75 Residual
## Null Deviance: 237
## Residual Deviance: 56.5 AIC: 360
## Next step: Use 'setx' method
Set values for the explanatory variables to their default mean values:
x.out <- setx(z.out)
Simulate fitted values:
s.out <- sim(z.out, x = x.out)
summary(s.out)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 2.975 0.3322 2.959 2.376 3.703
## pv
## qi
## 0 1 2 3 4 5 6 7 8 9 10 11
## 0.147 0.184 0.157 0.120 0.112 0.084 0.068 0.039 0.029 0.013 0.020 0.012
## 12 13 14 16
## 0.009 0.003 0.001 0.002
plot(s.out)
Let be the number of independent events that occur during a fixed time period. This variable can take any non-negative integer value.
The negative binomial distribution is derived by letting the mean of the Poisson distribution vary according to a fixed parameter given by the Gamma distribution. The stochastic component is given by
The marginal distribution of is then the negative binomial with mean and variance :
where is the systematic parameter of the Gamma distribution modeling .
The systematic component is given by
where is the vector of explanatory variables and is the vector of coefficients.
The expected values (qi$ev) are simulations of the mean of the stochastic component. Thus,
given simulations of .
The predicted value (qi$pr) drawn from the distribution defined by the set of parameters .
The first difference (qi$fd) is
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = negbin, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
Normal Regression for Continuous Dependent Variables
The Normal regression model is a close variant of the more standard least squares regression model (see ). Both models specify a continuous dependent variable as a linear function of a set of explanatory variables. The Normal model reports maximum likelihood (rather than least squares) estimates. The two models differ only in their estimate for the stochastic parameter .
With reference classes:
z5 <- znormal$new()
z5$zelig(Y ~ X1 + X ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "normal", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
Attach sample data:
data(macro)
Estimate model:
z.out1 <- zelig(unem ~ gdp + capmob + trade, model = "normal", data = macro)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2008.
## normal: Normal Regression for Continuous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Summarize of regression coefficients:
summary(z.out1)
## Model:
## $by
## [1] 1
##
##
## Call: stats::glm(formula = unem ~ gdp + capmob + trade, family = gaussian("identity"),
## data = .)
##
## Coefficients:
## (Intercept) gdp capmob trade
## 6.1813 -0.3236 1.4219 0.0199
##
## Degrees of Freedom: 349 Total (i.e. Null); 346 Residual
## Null Deviance: 3660
## Residual Deviance: 2610 AIC: 1710
## Next step: Use 'setx' method
Set explanatory variables to their default (mean/mode) values, with high (80th percentile) and low (20th percentile) values for trade:
x.high <- setx(z.out1, trade = quantile(macro$trade, 0.8))
x.low <- setx(z.out1, trade = quantile(macro$trade, 0.2))
Generate first differences for the effect of high versus low trade on GDP:
s.out1 <- sim(z.out1, x = x.high, x1 = x.low)
summary(s.out1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 5.43 0.1872 5.42 5.046 5.792
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 5.376 2.788 5.485 -0.1162 10.47
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 4.592 0.1781 4.592 4.254 4.948
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 4.723 2.73 4.727 -0.5893 10.02
## fd
## mean sd 50% 2.5% 97.5%
## [1,] -0.8381 0.2308 -0.8242 -1.309 -0.3974
A visual summary of quantities of interest:
plot(s.out1)
Let be the continuous dependent variable for observation .
The stochastic component is described by a univariate normal model with a vector of means and scalar variance :
The systematic component is
where is the vector of explanatory variables and is the vector of coefficients.
The expected value (qi$ev) is the mean of simulations from the the stochastic component,
given a draw of from its posterior.
The predicted value (qi$pr) is drawn from the distribution defined by the set of parameters .
The first difference (qi$fd) is:
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = normal, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The normal model is part of the stats package by . Advanced users may wish to refer to help(glm) and help(family).
Poisson Regression for Event Count Dependent Variables
Use the Poisson regression model if the observations of your dependent variable represents the number of independent events that occur during a fixed period of time (see the negative binomial model, , for over-dispersed event counts.) For a Bayesian implementation of this model, see .
With reference classes:
z5 <- zpoisson$new()
z5$zelig(Y ~ X1 + X ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "poisson", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
Load sample data:
data(sanction)
Estimate Poisson model:
z.out <- zelig(num ~ target + coop, model = "poisson", data = sanction)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2007.
## poisson: Poisson Regression for Event Count Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
summary(z.out)
## Model:
## $by
## [1] 1
##
##
## Call: stats::glm(formula = num ~ target + coop, family = poisson("log"),
## data = .)
##
## Coefficients:
## (Intercept) target coop
## -0.968 -0.021 1.211
##
## Degrees of Freedom: 77 Total (i.e. Null); 75 Residual
## Null Deviance: 1580
## Residual Deviance: 721 AIC: 944
## Next step: Use 'setx' method
Set values for the explanatory variables to their default mean values:
x.out <- setx(z.out)
Simulate fitted values:
s.out <- sim(z.out, x = x.out)
summary(s.out)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 3.244 0.2324 3.239 2.799 3.732
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 3.32 1.843 3 0 7
plot(s.out)
Let be the number of independent events that occur during a fixed time period. This variable can take any non-negative integer.
The Poisson distribution has stochastic component
where is the mean and variance parameter.
The systematic component is
where is the vector of explanatory variables, and is the vector of coefficients.
The expected value (qi$ev) is the mean of simulations from the stochastic component,
given draws of from its sampling distribution.
The predicted value (qi$pr) is a random draw from the poisson distribution defined by mean .
The first difference in the expected values (qi$fd) is given by:
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = poisson, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The poisson model is part of the stats package by . Advanced users may wish to refer to help(glm) and help(family).
Probit Regression for Dichotomous Dependent Variables
Use probit regression to model binary dependent variables specified as a function of a set of explanatory variables.
With reference classes:
z5 <- zprobit$new()
z5$zelig(Y ~ X1 + X ~ X, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "probit", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out, x1 = NULL)
Attach the sample turnout dataset:
data(turnout)
Estimate parameter values for the probit regression:
z.out <- zelig(vote ~ race + educate, model = "probit", data = turnout)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2007.
## probit: Probit Regression for Dichotomous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
summary(z.out)
## Model:
## $by
## [1] 1
##
##
## Call: stats::glm(formula = vote ~ race + educate, family = binomial("probit"),
## data = .)
##
## Coefficients:
## (Intercept) racewhite educate
## -0.7259 0.2991 0.0971
##
## Degrees of Freedom: 1999 Total (i.e. Null); 1997 Residual
## Null Deviance: 2270
## Residual Deviance: 2140 AIC: 2140
## Next step: Use 'setx' method
Set values for the explanatory variables to their default values.
x.out <- setx(z.out)
Simulate quantities of interest from the posterior distribution.
s.out <- sim(z.out, x = x.out)
summary(s.out)
plot(s.out1)
Let be the observed binary dependent variable for observation which takes the value of either 0 or 1.
The stochastic component is given by
where .
The systematic component is
where is the cumulative distribution function of the Normal distribution with mean 0 and unit variance.
The expected value (qi$ev) is a simulation of predicted probability of success
given a draw of from its sampling distribution.
The predicted value (qi$pr) is a draw from a Bernoulli distribution with mean .
The first difference (qi$fd) in expected values is defined as
The risk ratio (qi$rr) is defined as
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = probit, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The probit model is part of the stats package by . Advanced users may wish to refer to help(glm) and help(family).
Rare Events Logistic Regression for Dichotomous Dependent Variables
The relogit procedure estimates the same model as standard logistic regression (appropriate when you have a dichotomous dependent variable and a set of explanatory variables; see ), but the estimates are corrected for the bias that occurs when the sample is small or the observed events are rare (i.e., if the dependent variable has many more 1s than 0s or the reverse). The relogit procedure also optionally uses prior correction for case-control sampling designs.
With reference classes:
z5 <- zrelogit$new()
z5$zelig(Y ~ X1 + X2, tau = NULL,
case.control = c("prior", "weighting"),
bias.correct = TRUE, robust = FALSE,
data = mydata, ...)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "relogit", tau = NULL,
case.control = c("prior", "weighting"),
bias.correct = TRUE, robust = FALSE,
data = mydata, ...)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
The relogit procedure supports four optional arguments in addition to the standard arguments for zelig(). You may additionally use:
Note that if tau = NULL, bias.correct = FALSE, the relogit procedure performs a standard logistic regression without any correction.
Due to memory and space considerations, the data used here are a sample drawn from the full data set used in King and Zeng, 2001, The proportion of militarized interstate conflicts to the absence of disputes is . To estimate the model,
data(mid)
z.out1 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years, data = mid, model = "relogit", tau = 1042/303772)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, and Olivia Lau. 2014.
## relogit: Rare Events Logistic Regression for Dichotomous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Summarize the model output:
summary(z.out1)
## Model:
## $by
## [1] 1
##
##
## Call: relogit(formula = cbind(conflict, 1 - conflict) ~ major + contig +
## power + maxdem + mindem + years, data = ., tau = 0.00343020423212146,
## bias.correct = TRUE, case.control = "prior")
##
## Coefficients:
## (Intercept) major contig power maxdem
## -7.5084 2.4320 4.1080 1.0536 0.0480
## mindem years
## -0.0641 -0.0629
##
## Degrees of Freedom: 3125 Total (i.e. Null); 3119 Residual
## Null Deviance: 3980
## Residual Deviance: 1870 AIC: 1880
## Next step: Use 'setx' method
Set the explanatory variables to their means:
x.out1 <- setx(z.out1)
Simulate quantities of interest:
s.out1 <- sim(z.out1, x = x.out1)
summary(s.out1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.002388 0.000155 0.002383 0.002104 0.002702
## pv
## 0 1
## [1,] 0.998 0.002
plot(s.out1)
Suppose that we wish to perform case control correction using weighting (rather than the default prior correction). To estimate the model:
z.out2 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years, data = mid, model = "relogit", tau = 1042/303772, case.control = "weighting", robust = TRUE)
## Error: unused argument (robust = TRUE)
Summarize the model output:
summary(z.out2)
## Model:
## $by
## [1] 1
##
##
## Call:
## stats::lm(formula = unem ~ gdp + trade + capmob + as.factor(country),
## data = .)
##
## Coefficients:
## (Intercept) gdp
## -5.843 -0.110
## trade capmob
## 0.144 0.815
## as.factor(country)Belgium as.factor(country)Canada
## -1.599 6.759
## as.factor(country)Denmark as.factor(country)Finland
## 4.311 4.810
## as.factor(country)France as.factor(country)Italy
## 6.905 9.290
## as.factor(country)Japan as.factor(country)Netherlands
## 5.459 -1.459
## as.factor(country)Norway as.factor(country)Sweden
## -2.754 0.925
## as.factor(country)United Kingdom as.factor(country)United States
## 5.601 10.066
## as.factor(country)West Germany
## 3.364
##
## Next step: Use 'setx' method
Set the explanatory variables to their means:
x.out2 <- setx(z.out2)
Simulate quantities of interest:
s.out2 <- sim(z.out2, x = x.out2)
summary(s.out2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 -0.192 0.5793 -0.1943 -1.268 0.9895
## pv
## mean sd 50% 2.5% 97.5%
## 1 -0.192 0.5793 -0.1943 -1.268 0.9895
Suppose that we did not know that , but only that it was somewhere between . To estimate a model with a range of feasible estimates for (using the default prior correction method for case control correction):
z.out2 <- zelig(conflict ~ major + contig + power + maxdem + mindem + years, data = mid, model = "relogit", tau = c(0.002, 0.005))
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, and Olivia Lau. 2014.
## relogit: Rare Events Logistic Regression for Dichotomous Dependent Variables
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Summarize the model output:
z.out2
## Model:
## $by
## [1] 1
##
## $lower.estimate
##
## Call: (function (formula, data = sys.parent(), tau = NULL, bias.correct = TRUE,
## case.control = "prior", ...)
## {
## mf <- match.call()
## mf$tau <- mf$bias.correct <- mf$case.control <- NULL
## if (!is.null(tau)) {
## tau <- unique(tau)
## if (length(case.control) > 1)
## stop("You can only choose one option for case control correction.")
## ck1 <- grep("p", case.control)
## ck2 <- grep("w", case.control)
## if (length(ck1) == 0 & length(ck2) == 0)
## stop("choose either case.control = \"prior\" ", "or case.control = \"weighting\"")
## if (length(ck2) == 0)
## weighting <- FALSE
## else weighting <- TRUE
## }
## else weighting <- FALSE
## if (length(tau) > 2)
## stop("tau must be a vector of length less than or equal to 2")
## else if (length(tau) == 2) {
## mf[[1]] <- relogit
## res <- list()
## mf$tau <- min(tau)
## res$lower.estimate <- eval(as.call(mf), parent.frame())
## mf$tau <- max(tau)
## res$upper.estimate <- eval(as.call(mf), parent.frame())
## res$formula <- formula
## class(res) <- c("Relogit2", "Relogit")
## return(res)
## }
## else {
## mf[[1]] <- glm
## mf$family <- binomial(link = "logit")
## y2 <- model.response(model.frame(mf$formula, data))
## if (is.matrix(y2))
## y <- y2[, 1]
## else y <- y2
## ybar <- mean(y)
## if (weighting) {
## w1 <- tau/ybar
## w0 <- (1 - tau)/(1 - ybar)
## wi <- w1 * y + w0 * (1 - y)
## mf$weights <- wi
## }
## res <- eval(as.call(mf), parent.frame())
## res$call <- match.call(expand.dots = TRUE)
## res$tau <- tau
## X <- model.matrix(res)
## if (bias.correct) {
## pihat <- fitted(res)
## if (is.null(tau))
## wi <- rep(1, length(y))
## else if (weighting)
## res$weighting <- TRUE
## else {
## w1 <- tau/ybar
## w0 <- (1 - tau)/(1 - ybar)
## wi <- w1 * y + w0 * (1 - y)
## res$weighting <- FALSE
## }
## W <- pihat * (1 - pihat) * wi
## Qdiag <- lm.influence(lm(y ~ X - 1, weights = W))$hat/W
## if (is.null(tau))
## xi <- 0.5 * Qdiag * (2 * pihat - 1)
## else xi <- 0.5 * Qdiag * ((1 + w0) * pihat - w0)
## res$coefficients <- res$coefficients - lm(xi ~ X -
## 1, weights = W)$coefficients
## res$bias.correct <- TRUE
## }
## else res$bias.correct <- FALSE
## if (!is.null(tau) & !weighting) {
## if (tau <= 0 || tau >= 1)
## stop("\ntau needs to be between 0 and 1.\n")
## res$coefficients["(Intercept)"] <- res$coefficients["(Intercept)"] -
## log(((1 - tau)/tau) * (ybar/(1 - ybar)))
## res$prior.correct <- TRUE
## res$weighting <- FALSE
## }
## else res$prior.correct <- FALSE
## if (is.null(res$weighting))
## res$weighting <- FALSE
## res$linear.predictors <- t(res$coefficients) %*% t(X)
## res$fitted.values <- 1/(1 + exp(-res$linear.predictors))
## res$zelig <- "Relogit"
## class(res) <- c("Relogit", "glm")
## return(res)
## }
## })(formula = cbind(conflict, 1 - conflict) ~ major + contig +
## power + maxdem + mindem + years, data = ., tau = 0.002)
##
## Coefficients:
## (Intercept) major contig power maxdem
## -8.0492 2.4320 4.1079 1.0536 0.0480
## mindem years
## -0.0641 -0.0629
##
## Degrees of Freedom: 3125 Total (i.e. Null); 3119 Residual
## Null Deviance: 3980
## Residual Deviance: 1870 AIC: 1880
##
## $upper.estimate
##
## Call: (function (formula, data = sys.parent(), tau = NULL, bias.correct = TRUE,
## case.control = "prior", ...)
## {
## mf <- match.call()
## mf$tau <- mf$bias.correct <- mf$case.control <- NULL
## if (!is.null(tau)) {
## tau <- unique(tau)
## if (length(case.control) > 1)
## stop("You can only choose one option for case control correction.")
## ck1 <- grep("p", case.control)
## ck2 <- grep("w", case.control)
## if (length(ck1) == 0 & length(ck2) == 0)
## stop("choose either case.control = \"prior\" ", "or case.control = \"weighting\"")
## if (length(ck2) == 0)
## weighting <- FALSE
## else weighting <- TRUE
## }
## else weighting <- FALSE
## if (length(tau) > 2)
## stop("tau must be a vector of length less than or equal to 2")
## else if (length(tau) == 2) {
## mf[[1]] <- relogit
## res <- list()
## mf$tau <- min(tau)
## res$lower.estimate <- eval(as.call(mf), parent.frame())
## mf$tau <- max(tau)
## res$upper.estimate <- eval(as.call(mf), parent.frame())
## res$formula <- formula
## class(res) <- c("Relogit2", "Relogit")
## return(res)
## }
## else {
## mf[[1]] <- glm
## mf$family <- binomial(link = "logit")
## y2 <- model.response(model.frame(mf$formula, data))
## if (is.matrix(y2))
## y <- y2[, 1]
## else y <- y2
## ybar <- mean(y)
## if (weighting) {
## w1 <- tau/ybar
## w0 <- (1 - tau)/(1 - ybar)
## wi <- w1 * y + w0 * (1 - y)
## mf$weights <- wi
## }
## res <- eval(as.call(mf), parent.frame())
## res$call <- match.call(expand.dots = TRUE)
## res$tau <- tau
## X <- model.matrix(res)
## if (bias.correct) {
## pihat <- fitted(res)
## if (is.null(tau))
## wi <- rep(1, length(y))
## else if (weighting)
## res$weighting <- TRUE
## else {
## w1 <- tau/ybar
## w0 <- (1 - tau)/(1 - ybar)
## wi <- w1 * y + w0 * (1 - y)
## res$weighting <- FALSE
## }
## W <- pihat * (1 - pihat) * wi
## Qdiag <- lm.influence(lm(y ~ X - 1, weights = W))$hat/W
## if (is.null(tau))
## xi <- 0.5 * Qdiag * (2 * pihat - 1)
## else xi <- 0.5 * Qdiag * ((1 + w0) * pihat - w0)
## res$coefficients <- res$coefficients - lm(xi ~ X -
## 1, weights = W)$coefficients
## res$bias.correct <- TRUE
## }
## else res$bias.correct <- FALSE
## if (!is.null(tau) & !weighting) {
## if (tau <= 0 || tau >= 1)
## stop("\ntau needs to be between 0 and 1.\n")
## res$coefficients["(Intercept)"] <- res$coefficients["(Intercept)"] -
## log(((1 - tau)/tau) * (ybar/(1 - ybar)))
## res$prior.correct <- TRUE
## res$weighting <- FALSE
## }
## else res$prior.correct <- FALSE
## if (is.null(res$weighting))
## res$weighting <- FALSE
## res$linear.predictors <- t(res$coefficients) %*% t(X)
## res$fitted.values <- 1/(1 + exp(-res$linear.predictors))
## res$zelig <- "Relogit"
## class(res) <- c("Relogit", "glm")
## return(res)
## }
## })(formula = cbind(conflict, 1 - conflict) ~ major + contig +
## power + maxdem + mindem + years, data = ., tau = 0.005)
##
## Coefficients:
## (Intercept) major contig power maxdem
## -7.1300 2.4320 4.1080 1.0536 0.0480
## mindem years
## -0.0641 -0.0629
##
## Degrees of Freedom: 3125 Total (i.e. Null); 3119 Residual
## Null Deviance: 3980
## Residual Deviance: 1870 AIC: 1880
##
## $formula
## cbind(conflict, 1 - conflict) ~ major + contig + power + maxdem +
## mindem + years
## <environment: 0x7fd4fbf8e908>
##
## attr(,"class")
## [1] "Relogit2" "Relogit"
## Next step: Use 'setx' method
Set the explanatory variables to their means:
x.out2 <- setx(z.out2)
Simulate quantities of interest:
s.out <- sim(z.out2, x = x.out2)
## Error: no applicable method for 'vcov' applied to an object of class
## "c('Relogit2', 'Relogit')"
summary(s.out2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 -0.192 0.5793 -0.1943 -1.268 0.9895
## pv
## mean sd 50% 2.5% 97.5%
## 1 -0.192 0.5793 -0.1943 -1.268 0.9895
plot(s.out2)
The cost of giving a range of values for is that point estimates are not available for quantities of interest. Instead, quantities are presented as confidence intervals with significance less than or equal to a specified level (e.g., at least 95% of the simulations are contained in the nominal 95% confidence interval).
Like the standard logistic regression, the stochastic component for the rare events logistic regression is:
where is the binary dependent variable, and takes a value of either 0 or 1.
The systematic component is:
If the sample is generated via a case-control (or choice-based) design, such as when drawing all events (or “cases”) and a sample from the non-events (or “controls”) and going backwards to collect the explanatory variables, you must correct for selecting on the dependent variable. While the slope coefficients are approximately unbiased, the constant term may be significantly biased. Zelig has two methods for case control correction:
The “prior correction” method adjusts the intercept term. Let be the true population fraction of events, the fraction of events in the sample, and the uncorrected intercept term. The corrected intercept is:
The “weighting” method performs a weighted logistic regression to correct for a case-control sampling design. Let the 1 subscript denote observations for which the dependent variable is observed as a 1, and the 0 subscript denote observations for which the dependent variable is observed as a 0. Then the vector of weights
If is unknown, you may alternatively specify an upper and lower bound for the possible range of . In this case, the relogit procedure uses “robust Bayesian” methods to generate a confidence interval (rather than a point estimate) for each quantity of interest. The nominal coverage of the confidence interval is at least as great as the actual coverage.
By default, estimates of the the coefficients are bias-corrected to account for finite sample or rare events bias. In addition, quantities of interest, such as predicted probabilities, are also corrected of rare-events bias. If are the uncorrected logit coefficients and bias() is the bias term, the corrected coefficients are
The bias term is
where
where and are given in the “weighting” section above.
For either one or no :
The expected values (qi$ev) for the rare events logit are simulations of the predicted probability
given draws of from its posterior.
The predicted value (qi$pr) is a draw from a binomial distribution with mean equal to the simulated .
The first difference (qi$fd) is defined as
The risk ratio (qi$rr) is defined as
For a range of defined by , each of the quantities of interest are matrices, which report the lower and upper bounds, respectively, for a confidence interval with nominal coverage at least as great as the actual coverage. At worst, these bounds are conservative estimates for the likely range for each quantity of interest. Please refer to for the specific method of calculating bounded quantities of interest.
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
In conditional prediction models, the average predicted treatment effect (att.pr) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual predicted value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run z.out <- zelig(y ~ x, model = relogit, data), then you may examine the available information in z.out by using names(z.out), see the coefficients by using z.out$coefficients, and a default summary of information through summary(z.out).
The Stata version of ReLogit and the R implementation differ slightly in their coefficient estimates due to differences in the matrix inversion routines implemented in R and Stata. Zelig uses orthogonal-triangular decomposition (through lm.influence()) to compute the bias term, which is more numerically stable than standard matrix calculations.
Linear Regression for a Left-Censored Dependent Variable
Tobit regression estimates a linear regression model for a left-censored dependent variable, where the dependent variable is censored from below. While the classical tobit model has values censored at 0, you may select another censoring point. For other linear regression models with fully observed dependent variables, see Bayesian regression (), maximum likelihood normal regression (), or least squares ().
z5 <- ztobit$new()
z5$zelig(Y ~ X1 + X2, below = 0, above = Inf, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, below = 0, above = Inf, model = "tobit", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
zelig() accepts the following arguments to specify how the dependent variable is censored.
below: (defaults to 0) The point at which the dependent variable is censored from below. If any values in the dependent variable are observed to be less than the censoring point, it is assumed that that particular observation is censored from below at the observed value. (See for a Bayesian implementation that supports both left and right censoring.)
robust: defaults to FALSE. If TRUE, zelig() computes robust standard errors based on sandwich estimators (see and ) and the options selected in cluster.
cluster: if robust = TRUE, you may select a variable to define groups of correlated observations. Let x3 be a variable that consists of either discrete numeric values, character strings, or factors that define strata. Then
> z.out <- zelig(y ~ x1 + x2, robust = TRUE, cluster = "x3",
model = "tobit", data = mydata)
means that the observations can be correlated within the strata defined by the variable x3, and that robust standard errors should be calculated according to those clusters. If robust = TRUE but cluster is not specified, zelig() assumes that each observation falls into its own cluster.
Zelig users may wish to refer to help(survreg) for more information.
Attaching the sample dataset:
data(tobin)
Estimating linear regression using tobit:
z.out <- zelig(durable ~ age + quant, model = "tobit", data = tobin)
## How to cite this model in Zelig:
## Kosuke Imai, Gary King, Olivia Lau. 2011.
## tobit: Linear regression for Left-Censored Dependent Variable
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Setting values for the explanatory variables to their sample averages:
x.out <- setx(z.out)
Simulating quantities of interest from the posterior distribution given x.out.
s.out1 <- sim(z.out, x = x.out)
summary(s.out1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 1.547 0.6275 1.462 0.5676 2.978
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 3.291 4.331 1.637 0 13.96
Set explanatory variables to their default(mean/mode) values, with high (80th percentile) and low (20th percentile) liquidity ratio (quant):
x.high <- setx(z.out, quant = quantile(tobin$quant, prob = 0.8))
x.low <- setx(z.out, quant = quantile(tobin$quant, prob = 0.2))
Estimating the first difference for the effect of high versus low liquidity ratio on duration(durable):
s.out2 <- sim(z.out, x = x.high, x1 = x.low)
summary(s.out2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 1.184 0.7182 1.069 0.1974 2.957
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 3.113 4.147 1.364 0 13.46
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 2.103 1.05 1.954 0.5592 4.579
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 3.759 4.338 2.412 0 14.35
## fd
## mean sd 50% 2.5% 97.5%
## 1 0.9197 1.204 0.8075 -1.227 3.563
plot(s.out1)
Let be a latent dependent variable which is distributed with stochastic component
where is a vector means and is a scalar variance parameter. is not directly observed, however. Rather we observed which is defined as:
where is the lower bound below which is censored.
The systematic component is given by
where is the vector of explanatory variables for observation and is the vector of coefficients.
The expected values (qi$ev) for the tobit regression model are the same as the expected value of :
The first difference (qi$fd) for the tobit regression model is defined as
In conditional prediction models, the average expected treatment effect (qi$att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups.
The output of each Zelig command contains useful information which you may view. For example, if you run:
z.out <- zelig(y ~ x, model = "tobit", data)
then you may examine the available information in ``z.out`.
The tobit function is part of the survival library by Terry Therneau, ported to R by Thomas Lumley. Advanced users may wish to refer to help(survfit) in the survival library.
Given some unobserved explanatory variables and observed dependent variables, the Normal theory factor analysis model estimates the latent factors. The model is implemented using a Markov Chain Monte Carlo algorithm (Gibbs sampling with data augmentation). For factor analysis with ordinal dependent variables, see ordered factor analysis (), and for a mix of types of dependent variables, see the mixed factor analysis model ().
With reference classes:
z5 <- zfactorbayes$new()
z5$zelig(cbind(Y1 ,Y2, Y3) ~ NULL, factors = 2,
model = "factor.bayes", data = mydata)
With the Zelig 4 compatibility wrappers:
z.out <- zelig(cbind(Y1 ,Y2, Y3) ~ NULL, factors = 2,
model = "factor.bayes", data = mydata)
zelig() takes the following functions for factor.bayes:
In addition, zelig() accepts the following additional arguments for model specification:
In addition, zelig() accepts the following additional inputs for bayes.factor:
The model also accepts the following additional arguments to specify prior parameters:
Zelig users may wish to refer to help(MCMCfactanal) for more information.
Attaching the sample dataset:
data(swiss)
names(swiss) <- c("Fert", "Agr", "Exam", "Educ", "Cath", "InfMort")
Factor analysis:
z.out <- zelig(cbind(Agr, Exam, Educ, Cath, InfMort) ~ NULL,
model = "factor.bayes", data = swiss, factors = 2, verbose = TRUE,
a0 = 1, b0 = 0.15, burnin = 5000, mcmc = 50000)
Checking for convergence before summarizing the estimates:
algor <- try(geweke.diag(z.out$coefficients), silent=T)
if (class(algor) == "try-error")
print(algor)
Since the algorithm did not converge, we now add some constraints on to optimize the algorithm:
z.out <- zelig(cbind(Agr, Exam, Educ, Cath, InfMort) ~ NULL,
model = "factor.bayes", data = swiss, factors = 2,
lambda.constraints = list(Exam = list(1,"+"),
Exam = list(2,"-"), Educ = c(2, 0),
InfMort = c(1, 0)),
verbose = TRUE, a0 = 1, b0 = 0.15,
burnin = 5000, mcmc = 50000)
geweke.diag(z.out$coefficients)
heidel.diag(z.out$coefficients)
raftery.diag(z.out$coefficients)
summary(z.out)
Suppose for observation we observe variables and hypothesize that there are underlying factors such that:
where is the vector of manifest variables for observation . is the factor loading matrix and is the -vector of latent factor scores. Both and need to be estimated.
The stochastic component is given by:
where is a diagonal, positive definite matrix. The diagonal elements of are referred to as uniquenesses.
The systematic component is given by
The independent conjugate prior for each is given by
The independent conjugate prior for each is given by
The prior for is
where is a :math:` dtimes d ` identity matrix.
The output of each Zelig command contains useful information which you may view. For example, if you run:
z.out <- zelig(cbind(Y1, Y2, Y3), model = "factor.bayes", data)
then you may examine the available information in z.out by using names(z.out), see the draws from the posterior distribution of the coefficients by using z.out$coefficients, and view a default summary of information through summary(z.out). Other elements available through the $ operator are listed below.
Use Bayesian multinomial logistic regression to model unordered categorical variables. The dependent variable may be in the format of either character strings or integer values. The model is estimated via a random walk Metropolis algorithm or a slice sampler. See for the maximum-likelihood estimation of this model.
With reference classes:
z5 <- zmlogitbayes$new()
z5$zelig(Y ~ X1 + X2, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "mlogit.bayes", data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
zelig() accepts the following arguments for mlogit.bayes:
The model accepts the following additional arguments to monitor the Markov chains:
Use the following arguments to specify the priors for the model:
Zelig users may wish to refer to help(MCMCmnl) for more information.
Attaching the sample dataset:
data(mexico)
Estimating multinomial logistics regression using mlogit.bayes:
z.out <- zelig(vote88 ~ pristr + othcok + othsocok,
model = "mlogit.bayes", data = mexico)
## Calculating MLEs and large sample var-cov matrix.
## This may take a moment...
## Inverting Hessian to get large sample var-cov matrix.
## Warning: the condition has length > 1 and only the first element will be used
## Warning: the condition has length > 1 and only the first element will be used
##
##
## MCMCmnl Metropolis iteration 1 of 11000
## beta =
## -2.43503
## -2.72348
## -0.70416
## -0.71276
## 0.99671
## 1.16367
## 0.52819
## 0.50504
## Metropolis acceptance rate for beta = 1.00000
##
##
##
## MCMCmnl Metropolis iteration 1101 of 11000
## beta =
## -3.08406
## -3.23117
## -0.68172
## -0.58702
## 1.32006
## 1.46886
## 0.34517
## 0.08113
## Metropolis acceptance rate for beta = 0.76203
##
##
##
## MCMCmnl Metropolis iteration 2201 of 11000
## beta =
## -2.37163
## -2.72435
## -0.75901
## -0.65170
## 1.12979
## 1.04706
## 0.30992
## 0.63282
## Metropolis acceptance rate for beta = 0.77010
##
##
##
## MCMCmnl Metropolis iteration 3301 of 11000
## beta =
## -2.61718
## -2.37818
## -0.81263
## -0.72653
## 1.21112
## 1.15121
## 0.29893
## 0.22116
## Metropolis acceptance rate for beta = 0.76704
##
##
##
## MCMCmnl Metropolis iteration 4401 of 11000
## beta =
## -2.55979
## -3.33989
## -0.77931
## -0.58265
## 1.15036
## 1.32952
## 0.45473
## 0.48890
## Metropolis acceptance rate for beta = 0.76096
##
##
##
## MCMCmnl Metropolis iteration 5501 of 11000
## beta =
## -2.11066
## -2.75327
## -0.85810
## -0.62697
## 1.08461
## 1.14543
## 0.27174
## 0.40096
## Metropolis acceptance rate for beta = 0.76077
##
##
##
## MCMCmnl Metropolis iteration 6601 of 11000
## beta =
## -2.46162
## -2.81167
## -0.85914
## -0.57910
## 1.12878
## 1.21514
## 0.48948
## 0.32908
## Metropolis acceptance rate for beta = 0.75746
##
##
##
## MCMCmnl Metropolis iteration 7701 of 11000
## beta =
## -2.48861
## -3.34618
## -0.78412
## -0.50089
## 1.21538
## 1.47418
## 0.25571
## 0.14282
## Metropolis acceptance rate for beta = 0.76029
##
##
##
## MCMCmnl Metropolis iteration 8801 of 11000
## beta =
## -2.74870
## -2.47143
## -0.55028
## -0.66065
## 1.21813
## 1.19189
## 0.14761
## 0.21105
## Metropolis acceptance rate for beta = 0.75844
##
##
##
## MCMCmnl Metropolis iteration 9901 of 11000
## beta =
## -3.17385
## -3.28283
## -0.65244
## -0.60003
## 1.31771
## 1.24363
## 0.35401
## 0.61689
## Metropolis acceptance rate for beta = 0.75891
##
## How to cite this model in Zelig:
## Ben Goodrich, Ying Lu. 2013.
## mlogitbayes: Bayesian Multinomial Logistic Regression for Dependent Variables with Unordered Categorical Values
## in Kosuke Imai, Gary King, and Olivia Lau, "Zelig: Everyone's Statistical Software,"
## http://zeligproject.org/
Checking for convergence before summarizing the estimates:
raftery.diag(z.out$coefficients)
summary(z.out)
Setting values for the explanatory variables to their sample averages:
x.out <- setx(z.out)
Simulating quantities of interest from the posterior distribution given x.out.
s.out1 <- sim(z.out, x = x.out)
summary(s.out1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## P(Y=1) 0.5613 0.01592 0.5615 0.5307 0.5915
## P(Y=2) 0.2099 0.01273 0.2098 0.1854 0.2351
## P(Y=3) 0.2288 0.01360 0.2286 0.2034 0.2558
## pv
## qi
## 1 2 3
## 0.5596 0.2088 0.2316
Estimating the first difference (and risk ratio) in the probabilities of voting different candidates when pristr (the strength of the PRI) is set to be weak (equal to 1) versus strong (equal to 3) while all the other variables held at their default values.
x.weak <- setx(z.out, pristr = 1)
x.strong <- setx(z.out, pristr = 3)
s.out2 <- sim(z.out, x = x.strong, x1 = x.weak)
summary(s.out2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## P(Y=1) 0.7157 0.02128 0.7158 0.6726 0.7561
## P(Y=2) 0.1270 0.01458 0.1266 0.1001 0.1563
## P(Y=3) 0.1573 0.01646 0.1568 0.1261 0.1910
## pv
## qi
## 1 2 3
## 0.7179 0.1264 0.1557
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## P(Y=1) 0.4028 0.02358 0.4028 0.3563 0.4484
## P(Y=2) 0.3037 0.02131 0.3029 0.2638 0.3471
## P(Y=3) 0.2935 0.02189 0.2932 0.2518 0.3372
## pv
## qi
## 1 2 3
## 0.4051 0.2986 0.2963
## fd
## mean sd 50% 2.5% 97.5%
## P(Y=1) -0.3129 0.03460 -0.3129 -0.38111 -0.2443
## P(Y=2) 0.1767 0.02735 0.1765 0.12360 0.2314
## P(Y=3) 0.1362 0.02881 0.1363 0.07966 0.1936
Let be the (unordered) categorical dependent variable for observation which takes an integer values .
The stochastic component is given by:
where for .
The systematic component is given by
where is the vector of explanatory variables for observation and is the vector of coefficient for category . Category is assumed to be the baseline category.
The prior for is given by
where is the vector of means for the explanatory variables and is the precision matrix (the inverse of a variance-covariance matrix).
The expected values (qi$ev) for the multinomial logistics regression model are the predicted probability of belonging to each category:
and
given the posterior draws of for all categories from the MCMC iterations.
The predicted values (qi$pr) are the draws of from a multinomial distribution whose parameters are the expected values(qi$ev) computed based on the posterior draws of from the MCMC iterations.
The first difference (qi$fd) in category for the multinomial logistic model is defined as
The risk ratio (qi$rr) in category is defined as
In conditional prediction models, the average expected treatment effect (qi$att.ev) for the treatment group in category is
where is a binary explanatory variable defining the treatment () and control () groups, and is the number of treated observations in category .
In conditional prediction models, the average predicted treatment effect (qi$att.pr) for the treatment group in category is
where is a binary explanatory variable defining the treatment () and control () groups, and is the number of treated observations in category .
The output of each Zelig command contains useful information which you may view. For example, if you run:
z.out <- zelig(y ~ x, model = "mlogit.bayes", data)
then you may examine the available information in z.out by using names(z.out), see the draws from the posterior distribution of the coefficients by using z.out$coefficients, and view a default summary of information through summary(z.out). Other elements available through the $ operator are listed below.
Bayesian logistic regression is part of the MCMCpack library by Andrew D. Martin and Kevin M. Quinn . The convergence diagnostics are part of the CODA library by Martyn Plummer, Nicky Best, Kate Cowles, Karen Vines, Deepayan Sarkar, Russell Almond.