Quantile Regression for Continuous Dependent Variables
Use a linear programming implementation of quantile regression to estimate a linear predictor of the th conditional quantile of the population.
z5 <- zquantile$new()
z5$zelig(Y ~ X1 + X2, weights = w, data = mydata, tau = 0.5)
z5$setx()
z5$sim()
With the Zelig 4 compatibility wrappers:
z.out <- zelig(Y ~ X1 + X2, model = "rq", weights = w, data = mydata, tau = 0.5)
x.out <- setx(z.out)
s.out <- sim(z.out, x = x.out)
In addition to the standard inputs, zelig() takes the following additional options for quantile regression:
Attach sample data, in this case a dataset pertaining to the efficiency of plants that convert ammonia to nitric acid. The dependent variable, stack.loss, is 10 times the percentage of ammonia that escaped unconverted:
data(stackloss)
Estimate model:
z.out1 <- zelig(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
model = "rq", data = stackloss,
tau = 0.5)
## Warning in readLines(zeligmixedmodels): incomplete final line found on
## '/usr/lib64/R/library/ZeligMultilevel/JSON/zelig5mixedmodels.json'
## How to cite this model in Zelig:
## Alexander D'Amour. 2008.
## quantile: Quantile Regression for Continuous Dependent Variables
## in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
## "Zelig: Everyone's Statistical Software," http://zeligproject.org/
Summarize regression coefficients:
summary(z.out1)
## Model:
##
## Call: z5$zelig(formula = stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
## data = stackloss, tau = 0.5)
##
## tau: [1] 0.5
##
## Coefficients:
## coefficients lower bd upper bd
## (Intercept) -39.6899 -41.6197 -29.6775
## Air.Flow 0.8319 0.5128 1.1412
## Water.Temp 0.5739 0.3218 1.4109
## Acid.Conc. -0.0609 -0.2135 -0.0289
## Next step: Use 'setx' method
Set explanatory variables to their default (mean/mode) values, with high (80th percentile) and low (20th percentile) values for the water temperature variable (the variable that indiates the temperature of water in the plant’s cooling coils):
x.high <- setx(z.out1, Water.Temp = quantile(stackloss$Water.Temp, 0.8))
x.low <- setx(z.out1, Water.Temp = quantile(stackloss$Water.Temp, 0.2))
Generate first differences for the effect of high versus low water temperature on stack loss:
s.out1 <- sim(z.out1, x = x.high, x1 = x.low)
summary(s.out1)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 19.1 1.57 19.2 15.9 22.1
## pv
## mean sd 50% 2.5% 97.5%
## 1 19.1 1.81 19.2 15.6 22.7
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 15.7 1.17 15.7 13.4 17.9
## pv
## mean sd 50% 2.5% 97.5%
## 1 15.7 1.31 15.7 13.2 18.2
## fd
## mean sd 50% 2.5% 97.5%
## 1 -3.45 2.08 -3.43 -7.52 0.619
plot(s.out1)
Graphs of Quantities of Interest for Quantile Regression
We can estimate a model of unemployment as a function of macroeconomic indicators and fixed effects for each country (see for help with dummy variables). Note that you do not need to create dummy variables, as the program will automatically parse the unique values in the selected variable into discrete levels.
data(macro)
z.out2 <- zelig(unem ~ gdp + trade + capmob + as.factor(country),
model = "rq", tau = 0.5,
data = macro)
## Warning in readLines(zeligmixedmodels): incomplete final line found on
## '/usr/lib64/R/library/ZeligMultilevel/JSON/zelig5mixedmodels.json'
## How to cite this model in Zelig:
## Alexander D'Amour. 2008.
## quantile: Quantile Regression for Continuous Dependent Variables
## in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
## "Zelig: Everyone's Statistical Software," http://zeligproject.org/
Set values for the explanatory variables, using the default mean/mode values, with country set to the United States and Japan, respectively:
x.US <- setx(z.out2, country = "United States")
x.Japan <- setx(z.out2, country = "Japan")
Simulate quantities of interest:
s.out2 <- sim(z.out2, x = x.US, x1 = x.Japan)
summary(s.out2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 11.6 0.609 11.6 10.4 12.8
## pv
## mean sd 50% 2.5% 97.5%
## 1 11.6 0.61 11.6 10.4 12.8
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 6.85 0.404 6.85 6.08 7.65
## pv
## mean sd 50% 2.5% 97.5%
## 1 6.85 0.404 6.85 6.09 7.65
## fd
## mean sd 50% 2.5% 97.5%
## 1 -4.75 0.423 -4.75 -5.56 -3.95
plot(s.out2)
Graphs of Quantities of Interest for Quantile Regression
Using the Engel dataset (from the quantreg package) on food expenditure as a function of income, we can use the “quantile” model to estimate multiple conditional quantiles:
data(engel, package="quantreg")
z.out3 <- zelig(foodexp ~ income, model = "rq",
tau = seq(0.1, 0.9, by = 0.1), data = engel)
We can summarize the coefficient fits, or plot them to compare them to the least squares conditional mean estimator.
summary(z.out3)
plot(summary(z.out3))
Set the value of income to the top quartile and the bottom quartile of the income distribution for each fit:
x.bottom <- setx(z.out3, income=quantile(engel$income, 0.25))
x.top <- setx(z.out3, income=quantile(engel$income, 0.75))
Simulate quantities of interest for each fit simultaneously:
s.out3 <- sim(z.out3, x = x.bottom, x1 = x.top)
Summary
summary(s.out3)
The quantile estimator is best introduced by considering the sample median estimator and comparing it to the sample mean estimator. To find the mean of a sample, we solve for the quantity which minimizes the sum squared residuals:
Estimating a quantile is similar, but we solve for which minimizes the sum absolute residuals:
One can confirm the equivalence of these optimization problems and the standard mean and median operators by taking the derivative with respect to the argument and setting it to zero.
The relationship between quantile regression and ordinary least squares regression is analogous to the relationship between the sample median and the sample mean, except we are now solving for the conditional median or conditional mean given covariates and a linear functional form. The optimization problems for the sample mean and median are then easily generalized to optimization problems for estimating conditional means or medians by replacing or with a linear combination of covariates :
Equation [median] can be generalized to provide any quantile of the conditional distribution, not just the median. We do this by weighting the aboslute value function asymmetrically in proportion to the requested th quantile:
0)) + (1-\tau)I(Y-X_i'\beta > 0) \nonumber\end{aligned}"/>
We call the asymmetric absolute value function a “check function”. This optimization problem has no closed form solution and is solved via linear programming.
Equation [beta] now lets us define a conditional quantile estimator. Suppose that for a given set of covariates , the response variable has as true conditional probability distribution where can be any probability density function parametrized by a vector of parameters . This density function defines a value , the true th population quantile given . We can write our conditional quantile estimator as:
Where is the vector that solves equation [beta]. Because we solve for the estimator without constructing a likelihood function, it is not straightforward to specify a systematic and stochastic component for conditional quantile estimates. However, systematic and stochastic components do emerge asymptotically in the large- limit. Asymptotically, is normally distributed, and can be written with stochastic component
And systematic components
Where is the number of datapoints, and is the true population density at the th conditional quantile. Zelig uses this asymptotic approximation of stochastic and systematic components in simulation and numerically estimates the population density to derive . The simulation results should thus be treated with caution when using small datasets as both this asymptotic approximation and the population density approximation can break down.
The expected value (qi$ev) is the mean of simulations from the stochastic component,
given a draw of from its sampling distribution. Variation in the expected value distribution comes from estimation uncertainty of .
The predicted value (qi$pr) is the result of a single draw from the stochastic component given a draw of from its sampling distribution. The distribution of predicted values should be centered around the same place as the expected values but have larger variance because it includes both estimation uncertainty and fundamental uncertainty.
This model does not support conditional prediction.
The Zelig object stores fields containing everything needed to rerun the Zelig output, and all the results and simulations as they are generated. In addition to the summary commands demonstrated above, some simply utility functions (known as getters) provide easy access to the raw fields most commonly of use for further investigation.
In the example above z.out$getcoef() returns the estimated coefficients, z.out$getvcov() returns the estimated covariance matrix, and z.out$getpredict() provides predicted values for all observations in the dataset from the analysis.
The quantile regression package quantreg by Richard Koenker. In addition, advanced users may wish to refer to help(rq), help(summary.rq) and help(rq.object).
Koenker R (2017). quantreg: Quantile Regression. R package version 5.33, <URL: https://CRAN.R-project.org/package=quantreg>.