Generalized Estimating Equation for Gamma Regression
The GEE gamma is similar to standard gamma regression (appropriate when you have an uncensored, positive-valued, continuous dependent variable such as the time until a parliamentary cabinet falls). Unlike in gamma regression, GEE gamma allows for dependence within clusters, such as in longitudinal data, although its use is not limited to just panel data. GEE models make no distributional assumptions but require three specifications: a mean function, a variance function, and a “working” correlation matrix for the clusters, which models the dependence of each observation with other observations in the same cluster. The “working” correlation matrix is a matrix of correlations, where is the size of the largest cluster and the elements of the matrix are correlations between within-cluster observations. The appeal of GEE models is that it gives consistent estimates of the parameters and consistent estimates of the standard errors can be obtained using a robust “sandwich” estimator even if the “working” correlation matrix is incorrectly specified. If the “working” correlation matrix is correctly specified, GEE models will give more efficient estimates of the parameters. GEE models measure population-averaged effects as opposed to cluster-specific effects.
With reference classes:
z5 <- zgammagee$new()
z5$zelig(Y ~ X1 + X2, model = "gamma.gee",
id = "X3", weights = w, data = mydata)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "gamma.gee",
id = "X3", weights = w, data = mydata)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
where id is a variable which identifies the clusters. The data should be sorted by id and should be ordered within each cluster when appropriate.
Use the following arguments to specify the structure of the “working” correlations within clusters:
Attaching the sample turnout dataset:
data(coalition)
Sorted variable identifying clusters
coalition$cluster <- c(rep(c(1:62), 5),rep(c(63), 4))
sorted.coalition <- coalition[order(coalition$cluster), ]
Estimating model and presenting summary:
z.out <- zelig(duration ~ fract + numst2, model = "gamma.gee",
id = "cluster", data = sorted.coalition,
corstr = "exchangeable")
## Warning in readLines(zeligmixedmodels): incomplete final line found on
## '/usr/lib64/R/library/ZeligMultilevel/JSON/zelig5mixedmodels.json'
## How to cite this model in Zelig:
## Patrick Lam. 2011.
## gamma-gee: General Estimating Equation for Gamma Regression
## in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
## "Zelig: Everyone's Statistical Software," http://zeligproject.org/
summary(z.out)
## Model:
##
## Call:
## z5$zelig(formula = duration ~ fract + numst2, id = "cluster",
## corstr = "exchangeable", data = sorted.coalition)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -1.296e-02 1.268e-02 1.045 0.3067
## fract 1.149e-04 1.474e-05 60.761 6.44e-15
## numst2 -1.740e-02 6.294e-03 7.643 0.0057
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.6231 0.04483
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha -0.008086 0.03363
## Number of clusters: 63 Maximum cluster size: 5
## Next step: Use 'setx' method
Setting the explanatory variables at their default values (mode for factor variables and mean for non-factor variables), with numst2 set to the vector 0 = no crisis, 1 = crisis.
x.low <- setx(z.out, numst2 = 0)
x.high <- setx(z.out, numst2 = 1)
Simulate quantities of interest
s.out <- sim(z.out, x = x.low, x1 = x.high)
summary(s.out)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 14.41 1.122 14.37 12.42 16.8
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 14.56 17.89 8.322 0.068 65.74
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 19.2 1.067 19.16 17.22 21.39
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 19.55 25.76 10.18 0.1106 92.71
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 4.793 1.564 4.858 1.619 7.811
Generate a plot of quantities of interest:
plot(s.out)
Suppose we have a panel dataset, with denoting the positive-valued, continuous dependent variable for unit at time . is a vector or cluster of correlated data where is correlated with for some or all . Note that the model assumes correlations within but independence across .
The stochastic component is given by the joint and marginal distributions
where and are unspecified distributions with means and . GEE models make no distributional assumptions and only require three specifications: a mean function, a variance function, and a correlation structure.
The systematic component is the mean function, given by:
where is the vector of explanatory variables for unit at time and is the vector of coefficients.
The variance function is given by:
The correlation structure is defined by a “working” correlation matrix, where is the size of the largest cluster. Users must specify the structure of the “working” correlation matrix a priori. The “working” correlation matrix then enters the variance term for each , given by:
where is a diagonal matrix with the variance function as the th diagonal element, is the “working” correlation matrix, and is a scale parameter. The parameters are then estimated via a quasi-likelihood approach.
In GEE models, if the mean is correctly specified, but the variance and correlation structure are incorrectly specified, then GEE models provide consistent estimates of the parameters and thus the mean function as well, while consistent estimates of the standard errors can be obtained via a robust “sandwich” estimator. Similarly, if the mean and variance are correctly specified but the correlation structure is incorrectly specified, the parameters can be estimated consistently and the standard errors can be estimated consistently with the sandwich estimator. If all three are specified correctly, then the estimates of the parameters are more efficient.
All quantities of interest are for marginal means rather than joint means.
The method of bootstrapping generally should not be used in GEE models. If you must bootstrap, bootstrapping should be done within clusters, which is not currently supported in Zelig. For conditional prediction models, data should be matched within clusters.
The expected values (qi$ev) for the GEE gamma model is the mean:
given draws of from its sampling distribution, where is a vector of values, one for each independent variable, chosen by the user.
The first difference (qi$fd) for the GEE gamma model is defined as
In conditional prediction models, the average expected treatment effect (att.ev) for the treatment group is
where is a binary explanatory variable defining the treatment () and control () groups. Variation in the simulations are due to uncertainty in simulating , the counterfactual expected value of for observations in the treatment group, under the assumption that everything stays the same except that the treatment indicator is switched to .
The output of each Zelig command contains useful information which you may view. For example, if you run
z.out <- zelig(y ~ x, model = "gamma.gee", id, data)
then you may see a default summary of information through summary(z.out). Other elements available through the $ operator are listed below.
The geeglm function is part of the geepack package by Søren Højsgaard, Ulrich Halekoh and Jun Yan. Advanced users may wish to refer to help(geepack) and help(family).
Højsgaard S, Halekoh U and Yan J (2006). “The R Package geepack for Generalized Estimating Equations.” Journal of Statistical Software, 15/2, pp. 1-11.
Yan J and Fine JP (2004). “Estimating Equations for Association Structures.” Statistics in Medicine, 23, pp. 859-880.
Yan J (2002). “geepack: Yet Another Package for Generalized Estimating Equations.” R-News, 2/3, pp. 12-14.