Do you wanna have a bad time? 'Cause if your errors are correlated, you are REALLY not going to like what happens next.
Do you wanna have a bad time? 'Cause if your errors are correlated, you are REALLY not going to like what happens next.

I’ll be teaching a quantitative public policy analysis class for Clemson University’s Policy Studies program and I’m finalizing a syllabus for that class. My intuition is such a class I’ve been asked to teach will have a different focus than a graduate-level quantitative methods class. The overlap is obviously substantial but a graduate-level quantitative methods class will care more about statistical inference from sample statistics to population parameters under known assumptions (e.g. random sampling, central limit theorem) whereas a quantitative public policy analysis class will be more interested in causal inference and the scope of treatment effects. Further, the class itself will be an introduction at the graduate-level to a quantitative approach to policy analysis, and likely the first exposure students in the program are getting to statistics for the social sciences. The class will have to be gentle, but communicate an important concern in the policy analysis world: does a policy treatment work, and by how much?

There will have to be some discussion of endogeneity. Yes, that “E-word” is one that is easy to use to dismiss someone’s research. It’s so easy that invoking it may come across as a signal of laziness or contempt. Still, it’s an important topic the extent to which an endogenous treatment variable for a quantitative policy analysis can influence the kind of precision we want to communicate about treatment effects. After all, endogeneity emerges when a treatment is correlated with the error term and it’s ideal to address that in a regression framework. This post will offer an illustration of how to do that with an instrumental variable and a two-stage least squares (2SLS) regression.

First, let’s build a correlation matrix that communicates correlations among four types of variables. The first, control, is a standard statistical control that is not terribly interesting to us as researchers but we’ll include it anyway for a multiple regression. treat is the treatment of interest to us and instr is a possible instrument for treat that we have in the data. e is the error term.

vars = c("control", "treat", "instr", "e")
Correlations <- matrix(cbind(1, 0.001, 0.001, 0.001,
                             0.001, 1, 0.85, -0.5,
                             0.001, 0.85, 1, 0.001,
                             0.001, -0.5, 0.001, 1),nrow=4)
rownames(Correlations) <- colnames(Correlations) <- vars

The specified correlation matrix suggests the following relationships. First, control is fundamentally uncorrelated with anything else. Its correlations with the treatment variable, the potential instrument, and the errors are only .001. As a result, we are not too interested in this variable for the sake of this exercise. Second, the correlation between the treatment variable (treat) and the errors is -.5. This implies there is a fairly large—however imprecise that language is—negative correlation between the treatment variable that most concerns us and the error term. This makes the treatment endogenous to the errors. Third, the correlation between the treatment variable and the potential instrument is strong; a correlation of .85 is a strong positive relationship. Finally, the correlation between the instrumental variable and the errors is only .001. That means that instrumental variable (instr) satisfies the exclusion restriction; it will only affect the outcome through the treatment variable (treat).

We can generate some fake data to illustrate these correlations, though this exercise requires some Cholesky decomposition and more matrix-related stuff than I enjoy doing with data.

# number of observations to simulate
nobs = 1000

# Cholesky decomposition
U = t(chol(Correlations))
nvars = dim(U)[1]

# Jenny, I got your number...
set.seed(8675309)

# Random variables that follow the correlation matrix
rdata = matrix(rnorm(nvars*nobs,0,1), nrow=nvars, ncol=nobs)
X = U %*% rdata
# Transpose, convert to data, then tbl_df()
# require(tidyverse)
Data = t(X) %>% as.data.frame() %>% tbl_df()

The actual correlation matrix of the simulated data corresponds well enough with the specified correlation matrix.

A Correlation Matrix of the Data
ControlTreatmentInstrumente
Control10.0200.0160.003
Treatment0.02010.854-0.502
Instrument0.0160.8541-0.011
e0.003-0.502-0.0111

Let’s further assume that there is some outcome y that is a linear function of some slope-intercept (or “constant”) + control, treat, and the error term e. Such that:

Data$y <- with(Data, 5 + 1*control + 1*treat + e)

In other words, the true underlying effect of control and treat on the outcome y is 1 and the estimated value of y when all other parameters are at 0 is 5. A simple ordinary least squares model (i.e. M1 <- lm(y ~ control + treat, data=Data)) would produce the following results.

A Simple OLS Model Where the Treatment is Correlated With the Errors
Dependent variable:
Y (Outcome)
Control1.012***
(0.027)
Treatment0.511***
(0.027)
Constant5.011***
(0.027)
Observations1,000
Adjusted R20.644
Note:*p<0.1; **p<0.05; ***p<0.01

Notice the effect of the treatment variable is biased downward because of its negative correlation with the error term e. The true relationship is 1 but the coefficient is nowhere near it and 95% confidence intervals around the coefficient won’t be anywhere close to 1 either.

One solution here is to use an instrumental variable estimator for the affected treatment variable and employ a 2SLS regression. There are a lot of econometrics texts on what this is doing along with ample statistical notation and theoretical discussion, but here is how someone more interested in the application should think about this.

First, take all of the endogenous variables and run regressions with these as the dependent variable and all other exogenous and all instrumental variables as explanatory variables. These regressions generate predicted/fitted values for all the endogenous variables from what an applied researcher can think of as a “first stage regression.” This works when, in our case, all the explanatory variables in this first stage are uncorrelated with the error term and the ensuing fitted/predicted values for the endogoneous variable are also uncorrelated with the error term. The source of variation in the endogenous variable that was correlated with the error term got sucked into the error term of this first-stage regression. The “second stage” returns to the original OLS regression model but replaces the previously correlated variables with their fitted values from the first stage. The estimators that follow are unbiased and consistent.

In our case, this pertains to just one variable (treat) that we know is endogenous. While econometrics textbooks can bombard entry-level students with notation and theory to communicate this point, the application in R makes it seem much more accessible.

# First-stage model...
FSM <- lm(treat ~ control + instr, data=Data)

# Generate treat_hat variable
Data$treat_hat <- fitted(FSM)

# Second-stage model...
SSM <- lm(y  ~ control + treat_hat, data=Data)

The following table will show the results of all three analyses.

A Comparison of OLS and Two-Stage Least Squares (2SLS) Regression
Model
OLS (Endogenous Treatment)First-Stage ModelSecond-Stage Model
(1)(2)(3)
Control1.012***0.0061.003***
(0.027)(0.017)(0.016)
Treatment0.511***
(0.027)
Instrumental Variable0.847***
(0.016)
Treatment (fitted values)0.987***
(0.019)
Constant5.011***-0.0105.018***
(0.027)(0.017)(0.016)
Observations1,0001,0001,000
Adjusted R20.6440.7290.870
Note:*p<0.1; **p<0.05; ***p<0.01

The first model is the OLS model that showed a clear downward bias in the coefficient size for the treatment when the treatment is correlated with the error term. The true effect of the treatment on the response variable y is 1 but the OLS coefficient for the treatment is only .511. The first-stage model attempts to remove the variation in the treatment that is correlated with the error term by regressing the treatment variable on the control variable and the instrumental variable that is correlated with the treatment but not the error term. This results in fitted values for the treatment (treat_hat) that are substituted for the endogenous treatment variable in the second-stage model. This second-stage model is identical in form to the OLS model, but only with a treatment variable where the sources of endogeneity have been stripped from the variable. The coefficient for this fitted treatment variable approaches 1, which is what the true effect is from the data-generating process.

The goal for this post is to offer something more accessible to my future students in quantitative public policy analysis on how to deal with endogeneity in important treatment variables. There are a number of approaches here but instrumental varables and 2SLS are particularly attractive. Econometrics textbooks can make this seem daunting but students who learn more by application than by notation will find these tools relatively straightforward.