Abstract
This is a lab script for POST 8000, a graduate-level quantitative methods for public policy class that I teach at Clemson University. It will not be the most sophisticated R-related write-up of mine—check my blog for those—but it should be useful for discussion around the associated R script for the week’s ‘lab’ session.Some of what I write here is taken from my website. Please read these:
Be advised I might plagiarize myself here.
#’ We’ll be using {stevedata}
for some data and {tidyverse}
for most things. {stevemisc}
has the cor2data()
function that I will just copy-paste here for convenience. The only real new package to install, if you’ve yet to install it already, is {estimatr}
. This package will allow for a single wrapper for instrumental variable analysis that includes the important standard error corrections that you could not do manually. I will also cheat a little bit by showing you that you can also do instrumental variable analysis in {brms}
. However, I won’t ask you to install that now.
First, if you’ve yet to install these packages, do so.
<- function(packages) {
if_not_install <- packages[!(packages %in% installed.packages()[,"Package"])]
new_pack if(length(new_pack)) install.packages(new_pack)
}
if_not_install(c("estimatr", "tidyverse", "stevedata"))
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stevedata)
library(estimatr)
Let’s bring the cor2data()
function into the chat, so to say.
<- function(cor, n, seed){
cor2data # number of observations to simulate
= n
nobs
# Cholesky decomposition
= t(chol(cor))
U = dim(U)[1]
nvars
if(missing(seed)) {
else {
} set.seed(seed)
}
# Random variables that follow the correlation matrix
= matrix(rnorm(nvars*nobs,0,1), nrow=nvars, ncol=nobs)
rdata = U %*% rdata
X # Transpose, convert to data
# require(tidyverse)
= t(X)
Data = as.data.frame(Data)
Data return(Data)
}
We’re going to start with simulated data to show how instrumental variable (intermittently hereafter: “IV”) approaches can fix some endogeneity problems. Here, we have three variables of note. control
is 100% unproblematic and will be uncorrelated with anything. treat
is correlated with the instrument at .85 and correlated with the error term e
. In other words, there is an endogeneity problem here with the treatment variable that interests us. instr
is an instrument for the treated variable and is mercifully uncorrelated with e
. In other words, this instrument will satisfy all three conditions for a valid instrument. It has a strong correlation with the treatment (relevance). It is uncorrelated with anything else (exclusion) and we will later codify that further by making sure it has no systematic influence on the outcome variable. Further, it is not correlated with the error term (exogeneity).
Let’s go ahead and set up these data.
= c("control", "treat", "instr", "e")
vars <- matrix(cbind(1, 0, 0, 0,
Cor 0, 1, 0.85, -0.5,
0, 0.85, 1, 0,
0, -0.5, 0, 1),nrow=4)
rownames(Cor) <- colnames(Cor) <- vars
<- cor2data(Cor, 1000, 8675309) %>% as_tibble() # Jenny I got your number... Fake
Now, let’s make an outcome variable, again keeping in mind the exclusion restriction. Here, we will create an outcome variable (y
) as follows. The y-intercept (i.e. when control
and treat
are 0) is about 5. Both the control variable and the treatment variable have a coefficient of .5 (i.e. a one-unit change in both changes the outcome by .5). There is also that messy error term that will require some care downstream.
$y <- with(Fake, 5 + .5*control + .5*treat + e) Fake
Let’s lookie-loo (sic) at what we did. Because we played god creating these data, we can already anticipate some problems down the road. But, let’s pretend we don’t have an endogeneity problem and just go-go-gadget a linear model here. Here’s what would happen if we pretended that we had no endogeneity problem.
summary(M1 <-lm(y ~ control + treat, data=Fake))
##
## Call:
## lm(formula = y ~ control + treat, data = Fake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3647 -0.5683 0.0285 0.5853 2.9872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.01113 0.02716 184.52 <2e-16 ***
## control 0.51103 0.02690 19.00 <2e-16 ***
## treat 0.01118 0.02664 0.42 0.675
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8582 on 997 degrees of freedom
## Multiple R-squared: 0.2661, Adjusted R-squared: 0.2647
## F-statistic: 180.8 on 2 and 997 DF, p-value: < 2.2e-16
Notice that the control variable is 100% unaffected but the treatment variable is. Indeed, the nature of the problem reduces the coefficient to almost zero with standard errors that will clearly cover it. In other words, the endogeneity problem—that we built into these data, btw—means we can’t discern a signal from the noise. The true effect is nowhere close to the coefficient and we even get a type 2 error from NHST standpoint.
You might ask if this is a simple omitted variable problem. What if we included that instrument? Well, we should know ahead of time that won’t fix anything and this isn’t a simple omitted variable problem. The instrument has no systematic relationship with the outcome. So, here’s what would happen if we did the equivalent of administering antibiotics for a viral infection.
summary(M2 <-lm(y ~ control + treat + instr, data=Fake))
##
## Call:
## lm(formula = y ~ control + treat + instr, data = Fake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06715 -0.21683 -0.01408 0.22417 1.16757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.00050 0.01022 489.14 <2e-16 ***
## control 0.51275 0.01012 50.65 <2e-16 ***
## treat -1.26909 0.01928 -65.81 <2e-16 ***
## instr 1.48613 0.01912 77.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.323 on 996 degrees of freedom
## Multiple R-squared: 0.8961, Adjusted R-squared: 0.8958
## F-statistic: 2864 on 3 and 996 DF, p-value: < 2.2e-16
Not only did that not fix it, it may have made it worse the extent to which you might naively believe the treatment has a negative effect on the outcome. In other words, you misdiagnosed the problem and the prescription unsurprisingly did not make it better.
Because we satisfied the three assumptions, we can fix this with an IV approach. The process looks like this. First, regress your contaminated treatment on the instrument. Second, extract the fitted values from it. Third, regress the outcome on the fitted values from the previous regression. These “two stages” are why we call it “two-stage least squares (2SLS)” regression.
First, here’s the first-stage model. Notice the regression estimate is basically the correlation.
# First-stage model...
summary(FSM <- lm(treat ~ instr, data=Fake))
##
## Call:
## lm(formula = treat ~ instr, data = Fake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.58227 -0.35296 -0.01425 0.36274 1.43958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01020 0.01677 -0.608 0.543
## instr 0.84708 0.01632 51.908 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5303 on 998 degrees of freedom
## Multiple R-squared: 0.7297, Adjusted R-squared: 0.7294
## F-statistic: 2694 on 1 and 998 DF, p-value: < 2.2e-16
Next, extract the fitted values and add it to your data.
%>%
Fake mutate(treat_hat = fitted(FSM)) -> Fake
Then, do the second-stage model with the fitted values as the stand-in for the contaminated treatment. Notice: we basically captured the “true” effects of the control and treatment on the outcome after de-contaminating the endogenous treatment.
summary(SSM <- lm(y ~ control + treat_hat, data=Fake))
##
## Call:
## lm(formula = y ~ control + treat_hat, data = Fake)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.94932 -0.50827 0.01289 0.51401 2.34976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.01813 0.02363 212.36 <2e-16 ***
## control 0.50502 0.02340 21.58 <2e-16 ***
## treat_hat 0.48547 0.02713 17.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7467 on 997 degrees of freedom
## Multiple R-squared: 0.4444, Adjusted R-squared: 0.4433
## F-statistic: 398.8 on 2 and 997 DF, p-value: < 2.2e-16
The only caveat that IV people will have with doing this manually is that the standard errors in the second stage depend in part on what you did in the first phase, and that doing it manually has no way to account for that. You could write your own function to do this, but mercifully the iv_robust()
function in the {estimatr}
package will do this for you. When you’re using this function, the second stage goes left of the separator (|
) while the first stage goes to the right of it. It assumes your treatment variable is just to the left of the separator as well. Do note there are a few different forms of standard error corrections you can do here, but I’m going with the default (which is HC2, IIRC).
summary(IVM <- iv_robust(y ~ control + treat | instr + control, data=Fake))
##
## Call:
## iv_robust(formula = y ~ control + treat | instr + control, data = Fake)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 5.0180 0.03121 160.77 0.000e+00 4.9568 5.0793 997
## control 0.5021 0.03004 16.71 1.822e-55 0.4431 0.5610 997
## treat 0.4855 0.03554 13.66 4.582e-39 0.4158 0.5553 997
##
## Multiple R-squared: 0.03272 , Adjusted R-squared: 0.03078
## F-statistic: 236.9 on 2 and 997 DF, p-value: < 2.2e-16
If you wanted a peek at the Bayes world, you can also do this through the {brms}
package. This is definitely an advanced topic. I offer it here to show that the amazing {brms}
package offers a convenient wrapper for an IV analysis.
library(brms)
## Loading required package: Rcpp
## Loading 'brms' package (version 2.14.4). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
##
## Attaching package: 'brms'
## The following object is masked from 'package:stats':
##
## ar
<- bf(treat ~ instr) # first stage
bf1 <- bf(y ~ control + treat) # second stage
bf2
<- brm(bf1 + bf2, data=Fake, refresh = 0,
B1 seed = 8675309) # by default, should set rescor = TRUE, but I also don't know how to shut that up when spinning the doc.
## Setting 'rescor' to TRUE by default for this model
## Warning: In the future, 'rescor' will be set to FALSE by default for all models.
## It is thus recommended to explicitely set 'rescor' via 'set_recor' instead of
## using the default.
## Compiling Stan program...
## Start sampling
## Running MCMC with 4 chains, at most 32 in parallel...
##
## Chain 1 finished in 22.1 seconds.
## Chain 2 finished in 23.3 seconds.
## Chain 3 finished in 24.5 seconds.
## Chain 4 finished in 25.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 23.7 seconds.
## Total execution time: 25.2 seconds.
How different/similar are these results?
::tidy(SSM) broom
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.02 0.0236 212. 0.
## 2 control 0.505 0.0234 21.6 4.62e-85
## 3 treat_hat 0.485 0.0271 17.9 2.56e-62
::tidy(IVM) broom
## term estimate std.error statistic p.value conf.low conf.high
## 1 (Intercept) 5.0180309 0.03121322 160.76619 0.000000e+00 4.9567797 5.0792820
## 2 control 0.5020660 0.03004102 16.71268 1.822340e-55 0.4431151 0.5610169
## 3 treat 0.4855185 0.03554414 13.65959 4.582027e-39 0.4157686 0.5552684
## df outcome
## 1 997 y
## 2 997 y
## 3 997 y
summary(B1)
## Family: MV(gaussian, gaussian)
## Links: mu = identity; sigma = identity
## mu = identity; sigma = identity
## Formula: treat ~ instr
## y ~ control + treat
## Data: Fake (Number of observations: 1000)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## treat_Intercept -0.01 0.02 -0.04 0.02 1.00 2264 2214
## y_Intercept 5.02 0.03 4.96 5.08 1.00 2363 2389
## treat_instr 0.85 0.02 0.82 0.88 1.00 2156 2452
## y_control 0.51 0.01 0.49 0.53 1.00 3541 2932
## y_treat 0.49 0.04 0.42 0.55 1.00 2135 2263
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_treat 0.53 0.01 0.51 0.55 1.00 2673 2408
## sigma_y 0.99 0.03 0.93 1.04 1.00 1849 2340
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## rescor(treat,y) -0.94 0.00 -0.95 -0.94 1.00 2180 2247
##
## Samples were drawn using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Robust standard error estimation in this context will be less sanguine about your standard errors, but the effect here is slight. The Bayesian approach kind of bakes in this type of process into the model itself, also resulting in slightly more diffuse estimates. The main takeaway is case of six-of-one here, though, but be sure to explore how much your estimates might be affected by this process. There’s probably also a bootstrapping solution here within the IV approach, but I’d need to think about it a little further before going all Leeroy Jenkins on it for a lab script for graduate students.
Let’s do a real-world replication of an instrumental variable analysis, though I think the results here indicate a partial replication. I acquired these data in Mexico City in a two-week program at IPSA-FLACSO Mexico Summer School in 2019. Here, I milled around for about two weeks getting some ideas on how to teach causal inference stuff while 1) also eating amazing food, 2) seeing one of the best cities on the whole planet, and 3) getting my employer to foot the bill for the opportunity. Please invite me back for stuff, FLACSO.
Dee04
## # A tibble: 9,227 x 8
## schoolid hispanic college black otherrace female register distance
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1032 0 1 1 0 1 1 4
## 2 1032 0 1 1 0 1 1 4
## 3 1032 0 0 0 1 1 1 4
## 4 1032 0 0 1 0 0 0 4
## 5 1032 0 0 1 0 1 0 4
## 6 1032 0 0 1 0 1 1 4
## 7 1037 0 1 0 1 0 1 4
## 8 1037 0 1 0 1 0 1 4
## 9 1037 0 1 0 1 1 0 4
## 10 1037 0 1 0 1 0 1 4
## # … with 9,217 more rows
?Dee04
The application here is a common one. Basically: does increased education have downstream economic and societal benefits? The particular DV here is whether the respondent is registered to vote. The DV here is binary and the model here is linear. Don’t ever do this. Ever. Not even if there’s a fire.
summary(D1 <- lm(register ~ college, data=Dee04, na.action=na.exclude))
##
## Call:
## lm(formula = register ~ college, data = Dee04, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7510 -0.5741 0.2490 0.4259 0.4259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.574061 0.007141 80.39 <2e-16 ***
## college 0.176930 0.009654 18.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4616 on 9225 degrees of freedom
## Multiple R-squared: 0.03513, Adjusted R-squared: 0.03502
## F-statistic: 335.9 on 1 and 9225 DF, p-value: < 2.2e-16
The results suggest we’re in orbit of what Dee is reporting in Table 1 (first column), but let’s back up a little bit here. What Dee is trying to do is unpack a classic case of a tricky endogeneity problem, albeit one of supreme importance (i.e. the causal effect of higher education in the United States). However, there’s unobserved determinants of both. Certainly, respondents in the HS&B data had some civic values prior to college. Clearly, college entrance can’t explain something that happened before it.
Dee proposes an instrument that Card (1995) had actually introduced before. That is: the geographic availability of colleges closer to the respondent. Intuitively, more colleges closer to the respondent (esp. students from disadvantaged backgrounds) should explain college attendance, but those should not have any effect on adult outcomes.
Dee calculated a “distance” variable, which is the distance in miles between the respondent’s high school and the nearest college in the county at that time. There’s a lot to like about this instrument. Let’s start intuitively.
college
, which affects register
. However, there’s no reason, right now theoretical, to contend a direct relationship between college distance and being registered to vote. Indeed, this is the “only through” language. distance
should not directly correlated with register
, “only through” college
.How might we check the assumptions? First, we can check the correlation between the instrument (distance
) and the treatment (college
) to assess relevance.
%>% summarize(relevance = cor(distance, college)) Dee04
## # A tibble: 1 x 1
## relevance
## <dbl>
## 1 -0.111
It suggests a weaker instrument on the relevance frontier, certainly not as strong as our fake data.. However, a correlation test will vindicate it.
cor.test(Dee04$distance, Dee04$college)
##
## Pearson's product-moment correlation
##
## data: Dee04$distance and Dee04$college
## t = -10.764, df = 9225, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13147886 -0.09117564
## sample estimates:
## cor
## -0.111373
How about the “exclusion” assumption? Here’s a simple question: is the distance variable correlated with the outcome variable?
%>%
Dee04 summarize(exclusion = cor(distance, register))
## # A tibble: 1 x 1
## exclusion
## <dbl>
## 1 -0.0335
Yep. It’s a small correlation, but it’s one unlikely to be 0.
cor.test(Dee04$distance, Dee04$register)
##
## Pearson's product-moment correlation
##
## data: Dee04$distance and Dee04$register
## t = -3.2165, df = 9225, p-value = 0.001302
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05383772 -0.01307421
## sample estimates:
## cor
## -0.03346988
What about the other stuff?
%>%
Dee04 select(-schoolid, -college) %>%
select(distance, everything()) %>%
gather(var, val, 2:ncol(.)) %>%
group_by(var) %>%
summarize(cor = cor(distance, val))
## Warning: attributes are not identical across measure variables;
## they will be dropped
## # A tibble: 5 x 2
## var cor
## * <chr> <dbl>
## 1 black -0.0899
## 2 female -0.0128
## 3 hispanic -0.0708
## 4 otherrace -0.0307
## 5 register -0.0335
At the risk of generalizing out this code any further, I’m going to guess that if it’s a non-zero correlation with the registration variable, it’s going to be a non-zero correlation with the others too (but perhaps not the gender variable). Indeed, you could do that yourself and arrive at the same conclusion.
So, basically, exclusion restriction here is flunked. While these are not formal tests of the exclusion restriction, they are illustrative. You can use your head here too. Think of a case like the race/ethnic variables and the instrument. Notwithstanding some spatial diversity (in a state like SC, there is a fairly sizable rural black population), black Americans and Hispanic Americans tend to be urban and tend to be residing in places where there is more economic opportunity. Colleges are part of that story as well. That’s not to say that educational and economic opportunity are neatly provided to square with that, but it’s a selection effect of a kind and it’s going to be influencing the correlation between the instrument and those categories. If a respondent is black or Hispanic, they’re more likely closer to colleges (i.e. the distance
variable decreases). Makes sense, right? But it does mean we’re going to flunk the exclusion assumption.
I think we almost always are in some meaningful (if mostly symbolic, non-formal) way. Even in an ideal case, that “only through” language means there will likely be some correlation between the outcome and the instrument that will be evident, however naively. Indeed, here it is in our simple/ideal case.
%>% select(y, instr) %>% cor() Fake
## y instr
## y 1.0000000 0.4300242
## instr 0.4300242 1.0000000
However, that correlation is “only through” the treatment and you’ll recall there is no systematic effect of the instrument on the outcome. When you’re doing IV stuff, this is a story you tell and sell and not something you formally test.
Alrightie, let’s get to it then with a replication of Dee. I think we’re slightly off here because of the abbreviated nature of the data I got.
summary(D2 <- iv_robust(register ~ college + black + hispanic + female | distance + black + hispanic + female, data=Dee04))
##
## Call:
## iv_robust(formula = register ~ college + black + hispanic + female |
## distance + black + hispanic + female, data = Dee04)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 0.527284 0.04596 11.4725 2.907e-30 0.437191 0.61738 9222
## college 0.228787 0.08140 2.8108 4.952e-03 0.069234 0.38834 9222
## black 0.068605 0.01477 4.6456 3.438e-06 0.039657 0.09755 9222
## hispanic 0.034047 0.01501 2.2676 2.338e-02 0.004615 0.06348 9222
## female 0.005863 0.01004 0.5839 5.593e-01 -0.013821 0.02555 9222
##
## Multiple R-squared: 0.0348 , Adjusted R-squared: 0.03438
## F-statistic: 6.717 on 4 and 9222 DF, p-value: 2.149e-05
It suggests a civics return to education, as Dee reports.