Abstract
This is a lab script for POST 8000, a graduate-level quantitative methods for public policy class that I teach at Clemson University. It will not be the most sophisticated R-related write-up of mine—check my blog for those—but it should be useful for discussion around the associated R script for the week’s ‘lab’ session.We’ll be using {tidyverse}
for all things workflow and {stevedata}
for toy data. You may also want to install {stevemisc}
from Github, though we’ll load that cor2data()
function here so you don’t have to do this.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stevedata)
# You can either install and load stevemisc, or call this function randomly. Your pick.
<- function(cor, n, seed){
cor2data # number of observations to simulate
= n
nobs
# Cholesky decomposition
= t(chol(cor))
U = dim(U)[1]
nvars
if(missing(seed)) {
else {
} set.seed(seed)
}
# Random variables that follow the correlation matrix
= matrix(rnorm(nvars*nobs,0,1), nrow=nvars, ncol=nobs)
rdata = U %*% rdata
X # Transpose, convert to data
# require(tidyverse)
= t(X)
Data = as.data.frame(Data)
Data return(Data)
}
Let’s create a few data sets from the cor2data()
function. This should look very familiar to what we did the previous week with measurement error. We’ll keep it simple here and have just the one predictor (x1
) uncontaminated by any endogeneity concerns.
= c("x1", "e")
vars <- matrix(cbind(1, 0.001,
Cor 0.001, 1), nrow=2)
rownames(Cor) <- colnames(Cor) <- vars
<- cor2data(Cor, 1000, 8675309) %>% as_tibble() # Jenny I got your number... A
We’ll set up a simple situation where the outcome y
is a function of this formula. The expected value of y
is 1 when x1
is 0. The errors are kind of “skinny” too, at least relative to the effect of x1
.
$y <- with(A, 1 + x1 + .5*e) A
Okay then, let’s do a regression to show what we just did.
summary(M1 <- lm(y ~ x1, data=A))
##
## Call:
## lm(formula = y ~ x1, data = A)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9199 -0.3400 0.0034 0.3278 1.7039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.00204 0.01620 61.86 <2e-16 ***
## x1 0.99107 0.01582 62.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5122 on 998 degrees of freedom
## Multiple R-squared: 0.7973, Adjusted R-squared: 0.7971
## F-statistic: 3926 on 1 and 998 DF, p-value: < 2.2e-16
^ Nailed it.
Btw, if you just wanted to do a simple lm with one y on one x, ggplot could do this for you.
%>%
A ggplot(.,aes(x1,y)) +
geom_point() +
geom_smooth(method ="lm")
## `geom_smooth()` using formula 'y ~ x'
One of the properties of the SRF with one x
is the regression line will travel through the mean of y and x. Observe:
= mean(A$y)
a_meany = mean(A$x1)
a_meanx1
%>%
A ggplot(.,aes(x1,y)) +
geom_point() +
geom_smooth(method ="lm") +
geom_vline(xintercept = a_meanx1) +
geom_hline(yintercept = a_meany)
## `geom_smooth()` using formula 'y ~ x'
One of the assumptions is the mean of residuals will be zero and x
is uncorrelated with the residuals.
%>%
A mutate(resid = resid(M1)) %>%
summarize(corx1 = cor(resid, x1),
meanresid = mean(resid))
## # A tibble: 1 x 2
## corx1 meanresid
## <dbl> <dbl>
## 1 4.78e-17 8.38e-18
Always, always, always do some diagnostics on your OLS model. This will, in practice, mean looking at your residuals and extracting your fitted variables. Let’s do that:
%>%
A mutate(resid = resid(M1),
fitted = fitted(M1)) -> A
%>%
A ggplot(.,aes(fitted, resid)) +
geom_point() +
geom_smooth(method="lm") +
geom_hline(yintercept = 0, linetype="dashed")
## `geom_smooth()` using formula 'y ~ x'
%>%
A ggplot(.,aes(sample=resid)) + geom_qq()
%>%
A mutate(absqresid = sqrt(abs(resid))) %>%
ggplot(.,aes(fitted, absqresid)) +
geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Some definitions:
One way around this is to compare the regression line with and without those observations.
%>%
A mutate(stdresid = rstandard(M1), # standardized residuals, via handy built-in function
lev = hatvalues(M1)) %>% # leverages. Trust me, just do it this way. matrix algebra is a pain.
ggplot(.,aes(lev, stdresid)) +
geom_point() +
geom_smooth(method ="loess")
## `geom_smooth()` using formula 'y ~ x'
One thing happening underneath the proverbial hood of these diagnostic plots is a Cook’s distance score. Cook’s distance is a measure used to calculate influence in OLS. Higher values of Cook’s D = more influence. So what defines “high” and “influencer?” Some studies recommend a threshold of 4/n, or 4/(n-k-1). Alternatively: check if it’s above 1 if your data are sufficiently large. Values above that are problematic. Even then, so many rules of thumb. My hot #take: this is a “use your head” problem, following a graph of Cook’s D scores. And you’ll need to eyeball this, as I’ve already done:
tibble(obs = 1:1000,
cooksd = cooks.distance(M1)) %>%
bind_cols(A, .) -> A
%>%
A ggplot(.,aes(obs, cooksd)) + geom_bar(stat="identity") +
geom_hline(yintercept = .01, linetype="dashed") +
geom_text(data=subset(A, cooksd>.01), aes(y=cooksd, label=obs), size=4)
Traditional logic (and this is still valid): if you have problematic observations/influencers, punt them out, re-run the model, and compare the results. My hot #take: if you do that, the Cook’s D scores for the new regression will change as well. Cook’s D is kind of relative. Basically, take a look, find influencers/outliers, but don’t try to wed yourself to Cook’s D as a diagnostic. Base R has a lot of plots that could do this manually for you. It won’t be as pretty, and you shouldn’t present them “as is”, but they’re a little convenient. FYI: I hate built-in functions that make me hit return to see the next plot. It’s why I did this the ggplot way.
plot(M1)
This will return the four biggies that I described above, but you could grab an individual one, if you’d like. See:
plot(M1, which=1)
Want the Cook’s D histogram? That’s no. 4.
plot(M1, which=4)
There is one other too:
plot(M1, which=6)
It plots the leverage points against Cook’s D. Never have been a big fan of this plot.
Let’s replicate Newhouse (1977) now. I like this regression for this purpose not because it’s good, but because you won’t find many political science articles where the full damn data for the regression is in the article itself. That’s pretty cool and the topic is still accessible even if the analysis is clearly dated.
The data are in {stevedata}
as follows.
Newhouse77
## # A tibble: 13 x 5
## country year gdppc medsharegdp medexppc
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 1972 3769 5.8 219
## 2 Austria 1972 2747 5.46 150
## 3 Canada 1971 4317 7.02 303
## 4 Finland 1972 2869 5.31 152
## 5 France 1970 2851 6.4 182
## 6 GFR 1968 2246 6.07 136
## 7 Greece 1972 1374 2.51 34
## 8 Italy 1972 2164 6.12 132
## 9 Netherlands 1972 3437 6.78 233
## 10 Norway 1972 3889 5.18 201
## 11 Sweden 1971 4431 7.38 327
## 12 United Kingdom 1972 2742 4.45 122
## 13 United States 1972 5551 6.51 361
# for background info ?Newhouse77
Okie dokie. Let’s reproduce Table 2 now. Here, Newhouse is regressing medical care expenditures per capita as a function of GDP per capita. In his analysis, he does go-go-gadget OLS on the full data but then does another analysis that excludes Greece. In his analysis, Greece is apparently an outlier than can monkey with inferences.
summary(M1 <- lm(medexppc ~ gdppc , data=Newhouse77))
##
## Call:
## lm(formula = medexppc ~ gdppc, data = Newhouse77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.850 -15.866 -5.825 22.133 38.424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60.722513 23.598989 -2.573 0.0259 *
## gdppc 0.078831 0.006873 11.470 1.85e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.67 on 11 degrees of freedom
## Multiple R-squared: 0.9228, Adjusted R-squared: 0.9158
## F-statistic: 131.6 on 1 and 11 DF, p-value: 1.849e-07
summary(M2 <- lm(medexppc ~ gdppc, data=subset(Newhouse77, country !="Greece")))
##
## Call:
## lm(formula = medexppc ~ gdppc, data = subset(Newhouse77, country !=
## "Greece"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.814 -16.364 3.409 18.844 39.803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51.118202 29.167401 -1.753 0.11
## gdppc 0.076352 0.008212 9.297 3.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.49 on 10 degrees of freedom
## Multiple R-squared: 0.8963, Adjusted R-squared: 0.8859
## F-statistic: 86.44 on 1 and 10 DF, p-value: 3.086e-06
Basic takeaway maps to what Newhouse reports. Higher GDP per capita > higher per capita medical care expenditures. This holds with or without Greece.
Question, for my own curiosity: is Greece that much an outlier? It doesn’t really look like it. It’s an outlier in that it scores low on both the DV and the IV and not necessarily that it’s removed from what would be expected, given the data. If Greece is a problem observation, shouldn’t the U.S. be a problem observation as well? Greece scores as low as the U.S. scores high on both the IV and DV. Let’s plot the data (which you should always do for simple cases like this). Greece is pretty easy to identify here. But, again, it’s very much in line with what would be expected.
%>%
Newhouse77 ggplot(.,aes(gdppc, medexppc)) +
geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
So, to answer this question, what would the Cook’s D say?
%>%
Newhouse77 mutate(cooksd = cooks.distance(M1))
## # A tibble: 13 x 6
## country year gdppc medsharegdp medexppc cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 1972 3769 5.8 219 0.0244
## 2 Austria 1972 2747 5.46 150 0.00275
## 3 Canada 1971 4317 7.02 303 0.0807
## 4 Finland 1972 2869 5.31 152 0.0133
## 5 France 1970 2851 6.4 182 0.0240
## 6 GFR 1968 2246 6.07 136 0.0541
## 7 Greece 1972 1374 2.51 34 0.0862
## 8 Italy 1972 2164 6.12 132 0.0759
## 9 Netherlands 1972 3437 6.78 233 0.0340
## 10 Norway 1972 3889 5.18 201 0.181
## 11 Sweden 1971 4431 7.38 327 0.252
## 12 United Kingdom 1972 2742 4.45 122 0.0908
## 13 United States 1972 5551 6.51 361 0.228
I guess Greece is kind of an outlier, but the biggest influencers here seem to be Sweden and the United States (with Norway picking up bronze, if you will). Let’s kick them out, see what happens.
summary(lm(medexppc ~ gdppc, data=subset(Newhouse77, country != "Norway")))
##
## Call:
## lm(formula = medexppc ~ gdppc, data = subset(Newhouse77, country !=
## "Norway"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.196 -17.929 3.192 18.096 32.135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -63.679442 20.957463 -3.039 0.0125 *
## gdppc 0.080917 0.006177 13.100 1.28e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.63 on 10 degrees of freedom
## Multiple R-squared: 0.9449, Adjusted R-squared: 0.9394
## F-statistic: 171.6 on 1 and 10 DF, p-value: 1.275e-07
summary(lm(medexppc ~ gdppc, data=subset(Newhouse77, country != "United States")))
##
## Call:
## lm(formula = medexppc ~ gdppc, data = subset(Newhouse77, country !=
## "United States"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.61 -15.85 5.53 20.39 31.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -72.287075 28.353316 -2.550 0.0289 *
## gdppc 0.083029 0.008876 9.354 2.92e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.18 on 10 degrees of freedom
## Multiple R-squared: 0.8974, Adjusted R-squared: 0.8872
## F-statistic: 87.51 on 1 and 10 DF, p-value: 2.92e-06
summary(lm(medexppc ~ gdppc, data=subset(Newhouse77, country != "Sweden")))
##
## Call:
## lm(formula = medexppc ~ gdppc, data = subset(Newhouse77, country !=
## "Sweden"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.042 -13.212 -4.105 20.482 30.754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52.574043 22.279198 -2.36 0.04 *
## gdppc 0.075242 0.006676 11.27 5.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.6 on 10 degrees of freedom
## Multiple R-squared: 0.927, Adjusted R-squared: 0.9197
## F-statistic: 127 on 1 and 10 DF, p-value: 5.259e-07
Nothing materially seems to change. So, how about this to assess the model fit. We’ll do what’s called a “leave one out” approach here, or at least we’ll do it informally. We’ll write a simple script that just punts one observation out, repeat the regression, and we’ll present what happens to the effect of GDP per capita on per capita medical expenditures when that observation is removed.
<- Newhouse77$country
countries
<- function(dat){
loopcountries <- tibble()
output for(i in countries) {
%>%
dat filter(country != i) -> dat1
<- lm(medexppc ~ gdppc, data = dat1)
mod <- broom::tidy(mod) %>% mutate(omitted = i)
broomed <- bind_rows(output, broomed)
output
}return(output)
}
loopcountries(Newhouse77) %>%
bind_rows(., broom::tidy(M1) %>% mutate(omitted = "FULL MODEL")) %>%
mutate(lwr = estimate - abs(qnorm(.025))*std.error,
upr = estimate + abs(qnorm(.025))*std.error) %>%
# Let's get rid of the intercept here just to make this clearer
filter(term != "(Intercept)") %>%
ggplot(.,aes(as.factor(omitted), estimate, ymin=lwr, ymax=upr)) +
facet_wrap(~term) +
geom_pointrange() +
geom_hline(yintercept = 0, linetype="dashed") +
coord_flip()
Seems like the U.S. is more the influencer than Sweden or Greece, not that it materially changes much. Let’s get back on track with some diagnostics of this new M1 (from above). This is the full data, including Greece.
First, a residuals-fitted plot.
plot(M1, which=1)
This is mostly flat, but you’re asking a lot to get variation from 13 units. Oddly enough, the plot is complaining about Norway (makes sense), Sweden (makes sense), and the United Kingdom (which is odd, given the Cook’s D).
Here’ a Q-Q plot.
plot(M1, which=2)
The Q-Q plot has a jump in the middle there. Not too damning, but, again, asking a lot of 13 units.
Scale-location plot time:
plot(M1, which=3)
This scale-location plot has some weirdness happening. Again: 13 units.
Finally, the residuals-leverage plot.
plot(M1, which=5)
Oof, this really wants to call out the U.S. and Sweden as outliers, along with Norway. Well, fine. Let’s remove them and see what happens.
summary(lm(medexppc ~ gdppc, data=subset(Newhouse77, country != "United States" & country != "Sweden" & country != "Norway")))
##
## Call:
## lm(formula = medexppc ~ gdppc, data = subset(Newhouse77, country !=
## "United States" & country != "Sweden" & country != "Norway"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.168 -14.107 3.505 17.378 22.995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -71.311648 25.916105 -2.752 0.025 *
## gdppc 0.083326 0.008752 9.521 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.1 on 8 degrees of freedom
## Multiple R-squared: 0.9189, Adjusted R-squared: 0.9088
## F-statistic: 90.65 on 1 and 8 DF, p-value: 1.223e-05
I mean, sure, effect shifts up with them removed, but the estimate that still emerges is with a std.error of the full model.
Finally, let’s calculate elasticities from Table 3 and, goddamn, it’s been a hot minute since I’ve seen regression elasticities. I don’t think you’ll see this much in your travels. Indeed, the discussion here is very much in line with economics and the income elasticity of demand. Alas, let’s do it for max reproducibility.
tibble(gdppc = c(mean(Newhouse77$gdppc), 3416, 4000, 5000, 6000)) -> newdat # i.e. I smell an error since I know 3416 is not the actual mean.
# small little error. Basically: the first elasticity for each is correct but the mean is incorrectly labeled.
%>%
newdat mutate(m1coef = M1$coefficients["gdppc"],
m2coef = M2$coefficients["gdppc"],
pred1 = predict(M1, newdat = .),
pred2 = predict(M2, newdat = .),
el1 = m1coef*(gdppc)/pred1,
el2 = m2coef*(gdppc)/pred2)
## # A tibble: 5 x 7
## gdppc m1coef m2coef pred1 pred2 el1 el2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3261. 0.0788 0.0764 196. 198. 1.31 1.26
## 2 3416 0.0788 0.0764 209. 210. 1.29 1.24
## 3 4000 0.0788 0.0764 255. 254. 1.24 1.20
## 4 5000 0.0788 0.0764 333. 331. 1.18 1.15
## 5 6000 0.0788 0.0764 412. 407. 1.15 1.13
The results suggest that an implied income (regression) elasticity above 1 suggests medical care is a luxury good and not a normal good. I will, umm, not waste time on that particular point and its implications here.
Finally, here are the alternative estimations from Table 4. Of note: the I() wrapper in a regression is great for embedding on-the-fly transformations of the data. You don’t need it for on-the-fly log transformations. I recommend just creating additional columns.
summary(M3 <- lm(medsharegdp ~ I(1/gdppc), Newhouse77))
##
## Call:
## lm(formula = medsharegdp ~ I(1/gdppc), data = Newhouse77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2086 -0.5323 -0.1928 0.6558 1.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.1585 0.6415 12.717 6.39e-08 ***
## I(1/gdppc) -6883.1614 1719.4624 -4.003 0.00208 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8464 on 11 degrees of freedom
## Multiple R-squared: 0.593, Adjusted R-squared: 0.556
## F-statistic: 16.02 on 1 and 11 DF, p-value: 0.002075
summary(M4 <- lm(medsharegdp ~ gdppc, Newhouse77))
##
## Call:
## lm(formula = medsharegdp ~ gdppc, data = Newhouse77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88684 -0.92375 0.06491 0.88324 1.14878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3978705 0.9005664 3.773 0.00308 **
## gdppc 0.0007271 0.0002623 2.772 0.01816 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.018 on 11 degrees of freedom
## Multiple R-squared: 0.4113, Adjusted R-squared: 0.3577
## F-statistic: 7.684 on 1 and 11 DF, p-value: 0.01816
summary(M5 <- lm(log(medsharegdp) ~ log(gdppc), Newhouse77))
##
## Call:
## lm(formula = log(medsharegdp) ~ log(gdppc), data = Newhouse77)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35799 -0.16611 0.03743 0.13048 0.28300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7032 1.2602 -2.145 0.05512 .
## log(gdppc) 0.5510 0.1568 3.515 0.00484 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1991 on 11 degrees of freedom
## Multiple R-squared: 0.529, Adjusted R-squared: 0.4862
## F-statistic: 12.36 on 1 and 11 DF, p-value: 0.004841
summary(M6 <- lm(medsharegdp ~ I(1/gdppc), data=subset(Newhouse77, country != "Greece")))
##
## Call:
## lm(formula = medsharegdp ~ I(1/gdppc), data = subset(Newhouse77,
## country != "Greece"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4032 -0.4741 0.2248 0.6392 0.9988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.2383 0.9097 7.957 1.23e-05 ***
## I(1/gdppc) -3797.8563 2785.4186 -1.363 0.203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8139 on 10 degrees of freedom
## Multiple R-squared: 0.1568, Adjusted R-squared: 0.07244
## F-statistic: 1.859 on 1 and 10 DF, p-value: 0.2026