Abstract
This is a lab script for POST 8000, a graduate-level quantitative methods for public policy class that I teach at Clemson University. It will not be the most sophisticated R-related write-up of mine—check my blog for those—but it should be useful for discussion around the associated R script for the week’s ‘lab’ session.Some of what I write here is taken from my website. Please read these:
Be advised I might plagiarize myself here.
We’ll be using {stevedata}
for some data and {tidyverse}
for most things. I’m assuming you have yet to install {modelr}
, {fixest}
, and {lmtest}
. We’ll need those for this session.
<- c("lmtest", "fixest", "modelr")
packs <- packs[!(packs %in% installed.packages()[,"Package"])]
new_pack if(length(new_pack)) install.packages(new_pack)
Run that, and then let’s load them.
library(stevedata)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(fixest)
library(modelr)
options(warn=-1)
We’re going to tailor this lab script to what you should do when you violate one of a few core assumptions of OLS. We’ll focus on four here: linearity, normality, independence of errors, and constant error variances. In each case, we’ll talk about what the assumption means, what diagnostics you have at your disposal to see whether your data are violating the model assumptions, and what you can do about them.
One of the assumptions of the OLS model is that the outcome y is a linear function of the explanatory variable(s). Models that violate the linearity assumption amount to a specification error of the kind described by Gujarati. In a situation like this, the linear model is not appropriate to describe the relationship between y and the explanatory variable(s). This amounts to a regression model that will be biased.
First, let’s do this with the data/application used by Meuleman et al. (2019). In their case, they want to regress trust in the police in Belgium on an individual respondent’s age (in years), gender (dummy for women), education (in years of full-time education), income (four-categories), and an assessment of how successful the police are at preventing criminality (on an 11-point scale). Here’s the naive linear model.
summary(M1 <- lm(trstplc ~ agea + female + eduyrs + hincfel + plcpvcr, data=ESSBE5,
na.action=na.exclude))
##
## Call:
## lm(formula = trstplc ~ agea + female + eduyrs + hincfel + plcpvcr,
## data = ESSBE5, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1179 -1.0444 0.1995 1.2275 6.7873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.824357 0.287112 13.320 < 2e-16 ***
## agea -0.007021 0.002517 -2.789 0.00534 **
## female 0.003301 0.091945 0.036 0.97136
## eduyrs 0.052333 0.012660 4.134 3.75e-05 ***
## hincfel -0.267282 0.055594 -4.808 1.67e-06 ***
## plcpvcr 0.433734 0.024958 17.379 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.854 on 1642 degrees of freedom
## (56 observations deleted due to missingness)
## Multiple R-squared: 0.1791, Adjusted R-squared: 0.1766
## F-statistic: 71.63 on 5 and 1642 DF, p-value: < 2.2e-16
Pro tip: if you have NAs and you don’t care to drop those observations because you’re lazy and you want lm()
to drop them for you, specify na.action=na.exclude
in the formula. This will allow for some more convenience with helper functions.
The authors describe an F-test for lack of fit. Truthfully, I’ve never liked this test. The intuition is straightforward but I’ve never seen an application of it that I liked in the R programming language. I think you can assess the linearity diagnostic graphically. You can do it with the fitted-residual plot.
plot(M1, which=1)
Concerns about the somewhat discrete nature of the DV aside, I do see a slight, upward bow there for the higher fitted values.
I also really like the authors’ suggestion to compare a LOESS smoother to the the linear line in a faceted scatter plot. Here’s how you’d do that.
%>%
ESSBE5 # Select only the variables we need
select(trstplc:plcpvcr) %>%
# tidyr::gather from long to even longer.
# This creates a handy three column data frame where each value of each IV is matched with the DV for all observations
gather(var, value, 2:ncol(.)) %>%
# Create a ggplot where the x axis is the value of a given x variable.
ggplot(.,aes(value, trstplc)) +
# Create your facet now since the var variable is the particular x variable.
# Make sure to set scale="free_x" because these x variables are all on different scales
facet_wrap(~var, scale="free_x") +
# scatterplot
geom_point() +
# linear smoother
geom_smooth(method="lm") +
# loess smoother, with different color
geom_smooth(method="loess", color="black")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Here’s how you should interpret a plot like this: how much does the LOESS smoother and the LM smoother differ? My takeaway: not a whole lot. Be mindful the LOESS smoother will want to travel to the outlier. The cool thing about this approach is that a scatter plot like this will point you to some odd observations that you may want to punt out. For example, this is the second time I’ve seen a European respondent say they have 40+ years of full-time education. Which, c’mon, fam.
If you see any type of “U” shape, this implies a non-linear effect of x on y. The solution here is some kind of polynomial. The analog here with fake/hypothetical data would be something like lm(y ~ x + I(x^2), data=Data)
. More complicated shapes imply the need for some higher-order polynomial. My own hot #take here is you’re kind of losing the plot a little bit with the higher-order polynomials. In your own way, you’re fitting the model to the data (which is fine even as political scientists don’t like that practice much) and you’re still putting yourself in a position where you have to explain the results in a simple and coherent way.
Depending on your particular application, you may also want to identify the particular x variable that is offending a linear relationship and break into some kind of category variable and treat it as a fixed effect. Here is a simple example (and one that could also be modeled by a polynomial). There is a well-known “U” effect of age on happiness. Age may not be a “treatment”, but there are still a lot of studies that suggest the 20-year-olds are relatively happy people and the retirees are relatively happy. It’s the people in their 30s and 40s who, like me, are miserable, depressed bastards. If you look at that “shape of happiness” chart, maybe you can do something like create an age bracket for the 16-35 group, another for the 36-60 group, and another for the 61 and above group. Treat that as a fixed effect, possibly with the 36-60 people as the baseline.
There’s a mixed effects model comment I want to add here, but that really is more of an advanced topic.
OLS assumes the errors are normally distributed. This is often conflated with an assumption that the outcome variable is normally distributed. That’s not quite what it is. It does imply that the conditional distribution of the dependent variable is normal but that is not equivalent to assuming the marginal distribution of the dependent variable is normal. At the end of the day, the assumption of normality is more about the errors than the dependent variable even as the assumption about the former does strongly imply an assumption about the latter.
Violating the assumption of a normal distribution of the errors is not as severe as a violation of some of the other assumptions. The normality assumption is not necessary for point estimates to be unbiased. In one prominent textbook on statistical methods, Gelman and Hill (2007, p. 46) say the normality assumption is not important at all because it has no strong implication for the regression line. I think this follows because Gelman and Hill (2007)—later Gelman, Hill, and Vehtari (2020)—are nowhere near as interested in null hypothesis testing as your typical social scientist likely is. No matter, violating the assumption of a normal distribution of errors has some implication for drawing a line that reasonably approximates individual data points (if not the line itself, per se). Thus, you may want to check it, certainly if you have a small data set.
There are any number of diagnostics here for testing whether the residuals are normally distributed. Basically: gank your model’s residuals and check if they are consistent with an assumption that they are drawn from a normal distribution. You can do this with a Shapiro-Wilk test or a Kolmogorov-Smirnov test.
shapiro.test(resid(M1))
##
## Shapiro-Wilk normality test
##
## data: resid(M1)
## W = 0.97739, p-value = 1.905e-15
ks.test(resid(M1), y=pnorm)
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(M1)
## D = 0.15193, p-value < 2.2e-16
## alternative hypothesis: two-sided
In both applications, the assumption of a normal distribution of errors is violated. One caveat here might be that normality tests are sensitive to deviations from normality when sample sizes are large. They’re also sensitive to really any kind of discreteness in the dependent variable and any kind of heaping at extremes under those conditions.
You could also use our ol’ friend the Q-Q plot to show the same thing.
plot(M1, which=2)
There’s added information from the Q-Q plot since it’s saying that the negative residuals are more extreme than expected. Alas, compare that plot with an actual density plot of the residuals juxtaposed a normal density function with a mean of 0 and a standard deviation equal to the standard deviation of the residuals.
<- sd(as.numeric(resid(M1)), na.rm=T)
sd_resid
%>%
ESSBE5 mutate(resid = resid(M1)) %>%
ggplot(.,aes(resid)) +
geom_density() +
stat_function(fun = dnorm,
args = list(mean = 0, sd = sd_resid),
linetype="dashed", size=1.5)
Sure, it’s not normally distributed, but I also don’t think it wrong to say the normality assumption is approximated well even in a case where the DV has just 10 values.
This one is your call. In most applications where this is a severe issue, it’s because you’re trying to impose OLS on a binary DV or a five-item Likert. I will yell at you and throw things at you for doing that, but not necessarily for reasons related to this violation of OLS. There are rules of thumb for when you can take a technically ordinal/discrete DV and treat it as something you can model through OLS. In this case, I think this is fine and I agree with the authors doing it. In the case of a five-item Likert or a binary DV, it’s not fine and there’s better/more information to be gathered from a more appropriate model.
Further, it’s tough not to divorce the normality assumption from the homoskedasticity assumption even if they are separate. Both travel together in my applications even as they are clearly separate assumptions of OLS. Do you have outliers in your dependent variable? If so, consider some kind of transformation. Do you have cornball outliers in your independent variables (e.g. that one Belgian who said s/he had 40+ years of full-time education)? If so, consider some kind of transformation or even punting that observation out the model.
This one is a biggie. OLS assumes that observations are sampled independently and that any pair of errors are unrelated to each other. In formal terms, this assumption seems kind of Greek to students. In practice, think of it this way. A lot of the theoretical intuition behind OLS is assuming something akin to a simple random sample of the population. You probably learned central limit theorem by this point and how you can simulate that. However, you may have a data set that does not look like this.
In other words, this assumption is going to be violated like mad in any design that has a time-series, spatial, or multilevel component. For students new to this stuff, you should be able to anticipate ahead of time when your errors are no longer independent of each other.
The implications of non-independent errors—aka “serial correlation” or “autocorrelation”—are generally more about the variance of the estimator than the estimator itself. That said, there is reason to believe the standard errors are wrong and this can have important implications for statistical inference you’d like to do.
Honestly? Once you know under what conditions that autocorrelation is going to happen, you can probably skip this and go right to an appropriate model designed to deal with the nature of your data problem. The most common tests are the Breusch-Godfrey test and the Durbin-Watson test. Of the two, practitioners I’ve read seem to favor the former over the latter. There is also an issue as to what kind of higher-order autocorrelation you’re trying to detect. Both come in the {lmtest}
package.
bgtest(M1)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: M1
## LM test = 2.1561, df = 1, p-value = 0.142
dwtest(M1)
##
## Durbin-Watson test
##
## data: M1
## DW = 2.0721, p-value = 0.9259
## alternative hypothesis: true autocorrelation is greater than 0
In our case, since the Belgium data are a simple sample of a population, there doesn’t seem to be any concern for serial correlation. The p-values above a particular threshold (i.e. .05) mean the model is consistent with a null hypothesis of no autocorrelation.
If you’re curious what it would look like to flunk one of these tests, let’s use some fake data I created for precisely this purpose.
fakeTSD
## # A tibble: 100 x 5
## year y x1 x2 e
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1920 501. 3.01 1 -0.319
## 2 1921 504. 6.44 1 0.915
## 3 1922 503. 3.77 0 1.42
## 4 1923 502. 9.06 0 -1.10
## 5 1924 503. 7.13 1 -0.605
## 6 1925 503. 6.97 1 -1.23
## 7 1926 501. 5.05 0 -2.09
## 8 1927 502. 6.35 0 -0.934
## 9 1928 504. 6.14 0 0.296
## 10 1929 504. 6.81 0 -0.398
## # … with 90 more rows
?fakeTSD
These data were created to mimic a time-series type phenomenon where the outcome y is a linear function of 20 + (.25*year) + .25*x1 + 1*x2 + e
. x1
is a continuous variable and x2
is a dummy variable. Of note: year determines the data-generating process, but “year” isn’t really a variable of interest. I.e. it’s a time-series.
Here is a plot of y by increasing year.
%>%
fakeTSD ggplot(.,aes(year, y)) + geom_line() +
scale_x_continuous(breaks = seq(1920, 2020, by = 10))
Looks like a time-series, because it is a time-series.
Let’s see if we can go-go-gadget this through OLS.
summary(M2 <- lm(y ~ x1 + x2, data=fakeTSD))
##
## Call:
## lm(formula = y ~ x1 + x2, data = fakeTSD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3828 -6.2769 0.0628 6.6317 13.0777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.134e+02 2.368e+00 216.764 <2e-16 ***
## x1 2.871e-03 4.061e-01 0.007 0.994
## x2 1.407e+00 1.507e+00 0.934 0.353
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.476 on 97 degrees of freedom
## Multiple R-squared: 0.00898, Adjusted R-squared: -0.01145
## F-statistic: 0.4395 on 2 and 97 DF, p-value: 0.6456
Nooope. Let’s get our L from the Breusch-Godfrey test and the Durbin-Watson test.
bgtest(M2)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: M2
## LM test = 92.359, df = 1, p-value < 2.2e-16
dwtest(M2)
##
## Durbin-Watson test
##
## data: M2
## DW = 0.05506, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
Oh yeah, we earned that L.
You need a new model. In our fake time-series data, you can still reasonably approximate the true effects by including a lagged dependent variable as a predictor or by calculating y - lag(y)
and regressing that difference on x1
and x2
. No matter, the diagnostics for autocorrelation will still complain and practitioners are generally not sanguine about lagged DVs for a variety of reasons.
To reiterate the point: you need a new model. This class doesn’t have the opportunity to teach about some advanced topics, like time-series modeling, mixed effects models, and the like. No matter, you’ll want a model that explicitly accounts for what would be the source of autocorrelation. If I could teach an entire class about the cool shit you can do with a mixed effects model, I would.
OLS assumes that the variance of the errors is the same regardless of the particular value of x. This is often called “homoskedasticity”, mostly in juxtaposition to the violation of this assumption: “heteroskedasticity.” The change of prefix should be clear to what is implied here. Heteroskedasticity often means the errors are typically a function of something else in the model.
The implications for violating the assumption of homoskedasticity are more nuisance than real problem. Unequal/non-constant variance does not affect the XB part of the regression model, but it does have implications for predicting individual data points. Here, the implication is similar to the implication of violating the assumption of normally distributed errors. Indeed, those two typically travel together in my experiences.
Our ol’ pal, the fitted-residual plot, should be illustrative for most things. First, let’s show what it looks like in an ideal case.
set.seed(8675309)
tibble(
x = rnorm(1000, 5, 2),
e = rnorm(1000, 0, 1),
y = 1 + .25*x + e
-> C # for constant
)
summary(M3 <- lm(y ~ x, data=C))
##
## Call:
## lm(formula = y ~ x, data = C)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8524 -0.7030 -0.0324 0.6928 3.5313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95215 0.08499 11.20 <2e-16 ***
## x 0.26419 0.01588 16.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 998 degrees of freedom
## Multiple R-squared: 0.2171, Adjusted R-squared: 0.2163
## F-statistic: 276.8 on 1 and 998 DF, p-value: < 2.2e-16
^ The model is well identified.
%>%
C mutate(resid = resid(M3),
fitted = fitted(M3)) %>%
ggplot(.,aes(fitted, resid)) +
geom_point()
^ The fitted-residual plot looks fine.
Now, let’s deliberately violate this assumption. Here, we’ll go with the most obvious manifestation of heteroskedasticity: the “cone of shame.” Notice that the errors are an increasing function of the x variable.
tibble(
x = rnorm(1000, 5, 2),
e = rnorm(1000, 0, 1 + .25*x), # uh-oh...
y = 1 + .25*x + e
-> H # for heteroskedasticity
)
summary(M4 <- lm(y ~ x, data=H))
##
## Call:
## lm(formula = y ~ x, data = H)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4465 -1.4774 -0.0426 1.4169 8.8592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.96810 0.18824 5.143 3.26e-07 ***
## x 0.24828 0.03499 7.096 2.44e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.276 on 998 degrees of freedom
## Multiple R-squared: 0.04803, Adjusted R-squared: 0.04708
## F-statistic: 50.35 on 1 and 998 DF, p-value: 2.435e-12
The model still looks well identified, but notice the standard errors went up.
%>%
H mutate(resid = resid(M4),
fitted = fitted(M4)) %>%
ggplot(.,aes(fitted, resid)) +
geom_point()
Yeah, I see the cone of shame here. This is a pretty tell-tale case of heteroskedasticity.
There is a formal test for this, though: the Breusch-Pagan test, which comes as bptest()
in the {lmtest}
package. The null hypothesis in the test is homoskedasticity. If you can reject that (via low-enough p-value), you have evidence of heteroskedasticity. Observe:
bptest(M3)
##
## studentized Breusch-Pagan test
##
## data: M3
## BP = 1.1241, df = 1, p-value = 0.289
bptest(M4)
##
## studentized Breusch-Pagan test
##
## data: M4
## BP = 76.113, df = 1, p-value < 2.2e-16
For giggles, observe that these data sets also imply a non-normal distribution of the errors. In this case, the Shapiro-Wilk test didn’t see a problem but the Kolmogorov-Smirnov does. To be clear, they are different assumptions about OLS. But, in my experience, if I have one, I have the other too.
shapiro.test(resid(M3))
##
## Shapiro-Wilk normality test
##
## data: resid(M3)
## W = 0.99805, p-value = 0.3037
shapiro.test(resid(M4))
##
## Shapiro-Wilk normality test
##
## data: resid(M4)
## W = 0.99834, p-value = 0.4541
ks.test(resid(M3), y=pnorm)
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(M3)
## D = 0.026671, p-value = 0.4754
## alternative hypothesis: two-sided
ks.test(resid(M4), y=pnorm)
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(M4)
## D = 0.18307, p-value < 2.2e-16
## alternative hypothesis: two-sided
There is no shortage of fixes for heteroskedasticity. First, I want to bring in a new data set that I love using for this purpose because it’s super illustrative of what heteroskedasticity will look like in a simple case. af_crime93
comes via Table 9.1 of the 3rd edition of Agresti and Finlay’s Statistical Methods for the Social Sciences. The data come from the Statistical Abstract of the United States and most variables were measured in 1993.
I have this in the proto version of my {stevedata}
package right now, but you can load it off the course website.
<- readRDS(url("http://post8000.svmiller.com/lab-scripts/af_crime93.rds"))
af_crime93 af_crime93
## # A tibble: 51 x 8
## state violent murder poverty single metro white highschool
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AK 761 9 9.10 14.3 41.8 75.2 86.6
## 2 AL 780 11.6 17.4 11.5 67.4 73.5 66.9
## 3 AR 593 10.2 20 10.7 44.7 82.9 66.3
## 4 AZ 715 8.60 15.4 12.1 84.7 88.6 78.7
## 5 CA 1078 13.1 18.2 12.5 96.7 79.3 76.2
## 6 CO 567 5.80 9.90 12.1 81.8 92.5 84.4
## 7 CT 456 6.30 8.5 10.1 95.7 89 79.2
## 8 DE 686 5 10.2 11.4 82.7 79.4 77.5
## 9 FL 1206 8.90 17.8 10.6 93 83.5 74.4
## 10 GA 723 11.4 13.5 13 67.7 70.8 70.9
## # … with 41 more rows
In this application, Agresti and Finlay try to model the state’s violent crime rate (per 100,000 people in the population) as a function of 1) the percent of the state with income below the poverty level (poverty
), 2) the percent of families in the state headed by a single parent (single
), 3) the percent of the population living in metropolitan areas (metro
), 4) the percentage of the state that is white (white
), and 5) the percent of the state that graduated from high school (highschool
).
Let’s run a naive model and see what happens.
summary(M5 <- lm(violent ~ poverty + single + metro + white + highschool, data=af_crime93))
##
## Call:
## lm(formula = violent ~ poverty + single + metro + white + highschool,
## data = af_crime93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -533.27 -87.19 -9.82 110.46 397.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1795.904 668.788 -2.685 0.0101 *
## poverty 26.244 11.083 2.368 0.0222 *
## single 109.467 20.360 5.377 2.6e-06 ***
## metro 7.609 1.295 5.874 4.8e-07 ***
## white -4.483 2.779 -1.613 0.1137
## highschool 8.646 7.826 1.105 0.2751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 180.2 on 45 degrees of freedom
## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8332
## F-statistic: 50.93 on 5 and 45 DF, p-value: < 2.2e-16
It looks like the poverty, single-parent family, and metro variables are statistically significant. The white variable looks like it’s almost significant at the .10 level too.
Cool, let’s do the Breusch-Pagan test.
bptest(M5)
##
## studentized Breusch-Pagan test
##
## data: M5
## BP = 15.397, df = 5, p-value = 0.008793
…and we flunked it. All right, let’s take some inventory as to what happened here with the fitted-residual plot.
%>%
af_crime93 mutate(fitted = fitted(M5),
resid = resid(M5)) %>%
ggplot(.,aes(fitted, resid)) + geom_point()
Oh wow. Fitted-residual plots shouldn’t look like this. The post on my website offers a bit more information as to the offending cases, but let’s go identify what seem to be the biggest offenders here.
%>%
af_crime93 mutate(resid = resid(M5)) %>%
filter(abs(resid) >= 300)
## # A tibble: 4 x 9
## state violent murder poverty single metro white highschool resid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FL 1206 8.90 17.8 10.6 93 83.5 74.4 398.
## 2 LA 1062 20.3 26.4 14.9 75 66.7 68.3 -328.
## 3 MS 434 13.5 24.7 14.7 30.7 63.3 64.3 -533.
## 4 DC 2922 78.5 26.4 22.1 100 31.8 73.1 355.
So, what can we do here? Well, for one, I’m a little convinced just from eyeballing these data that DC is a weird state (i.e. not technically a state, but should be a state). In a lot of ways, DC looks nothing like the other 50 because of the nature of its territorial endowment and population. It’s a city-state.
First, let me verify that intuition.
%>%
af_crime93 select(-murder) %>%
gather(var, val, 3:ncol(.)) %>%
ggplot(.,aes(val, violent)) +
geom_point() + facet_wrap(~var, scale="free_x")
%>%
af_crime93 filter(state == "DC")
## # A tibble: 1 x 8
## state violent murder poverty single metro white highschool
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 DC 2922 78.5 26.4 22.1 100 31.8 73.1
Yeah, that’s DC.
One thing I can do here is just punt out DC and see what happens.
summary(M6 <- lm(violent ~ poverty + single + metro + white + highschool, data=subset(af_crime93, state != "DC")))
##
## Call:
## lm(formula = violent ~ poverty + single + metro + white + highschool,
## data = subset(af_crime93, state != "DC"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -393.47 -117.63 -8.91 122.64 404.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1173.073 632.397 -1.855 0.070309 .
## poverty 21.278 10.123 2.102 0.041321 *
## single 80.893 20.290 3.987 0.000249 ***
## metro 7.542 1.170 6.444 7.5e-08 ***
## white -2.486 2.581 -0.963 0.340774
## highschool 3.292 7.250 0.454 0.652057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162.8 on 44 degrees of freedom
## Multiple R-squared: 0.7282, Adjusted R-squared: 0.6973
## F-statistic: 23.58 on 5 and 44 DF, p-value: 1.919e-11
One takeaway I’m seeing already (and wrote about on my blog a bit). The discernibility of that white
coefficient is almost entirely a function of DC. DC is a majority non-white (should-be-a) state and has a high violent crime rate in these data. There really is no discernible effect of the white
coefficient without it.
But, let’s see what the Breusch-Pagan test says.
bptest(M6)
##
## studentized Breusch-Pagan test
##
## data: M6
## BP = 5.9924, df = 5, p-value = 0.307
…and we passed. DC alone will flunk a Breusch-Pagan test in this application. If you have a simple case like this where it’s just one observation that is messing with an overall picture, you can punt it out.
The textbook solution is a weighted least squares (WLS) estimation. The process here is a little convoluted and there is a simple wrapper for it somewhere (I’m sure), but here’s what the process looks like. 1) Run the model (which we already did). 2) Gank the residuals and fitted values. 3) Regress the absolute value of the residuals on the fitted values of the original model. 4) Extract those fitted values. 5) Square them and 6) divide 1 over those values. 7) Finally, apply those as weights in the linear model once more for a re-estimation.
%>%
af_crime93 mutate(wts = 1/fitted(lm(abs(residuals(M5))~fitted(M5)) )^2) -> af_crime93
summary(M7 <- lm(violent ~ poverty + single + metro + white + highschool, data=af_crime93, weights=wts))
##
## Call:
## lm(formula = violent ~ poverty + single + metro + white + highschool,
## data = af_crime93, weights = wts)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.93201 -0.89660 -0.07732 0.85497 2.70104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1403.562 628.576 -2.233 0.0306 *
## poverty 25.593 10.252 2.496 0.0163 *
## single 88.909 18.979 4.685 2.61e-05 ***
## metro 6.663 1.117 5.964 3.53e-07 ***
## white -3.548 2.472 -1.435 0.1582
## highschool 6.400 7.052 0.907 0.3690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.268 on 45 degrees of freedom
## Multiple R-squared: 0.7827, Adjusted R-squared: 0.7585
## F-statistic: 32.42 on 5 and 45 DF, p-value: 7.452e-14
We can compare the unweighted OLS with the WLS estimates.
::tidy(M5) %>% mutate(cat = "OLS") %>%
broombind_rows(., broom::tidy(M7) %>% mutate(cat = "WLS")) %>%
arrange(term)
## # A tibble: 12 x 6
## term estimate std.error statistic p.value cat
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) -1796. 669. -2.69 0.0101 OLS
## 2 (Intercept) -1404. 629. -2.23 0.0306 WLS
## 3 highschool 8.65 7.83 1.10 0.275 OLS
## 4 highschool 6.40 7.05 0.907 0.369 WLS
## 5 metro 7.61 1.30 5.87 0.000000480 OLS
## 6 metro 6.66 1.12 5.96 0.000000353 WLS
## 7 poverty 26.2 11.1 2.37 0.0222 OLS
## 8 poverty 25.6 10.3 2.50 0.0163 WLS
## 9 single 109. 20.4 5.38 0.00000260 OLS
## 10 single 88.9 19.0 4.68 0.0000261 WLS
## 11 white -4.48 2.78 -1.61 0.114 OLS
## 12 white -3.55 2.47 -1.43 0.158 WLS
You could also do some on-the-fly heteroskedasticity corrections to your standard errors. There are a lot to choose from here, and some are more tedious than others. I’ll use the {fixest}
package here because it’s the most convenient. You may also want to look into the {sandwich}
package as well.
summary(M8 <- feols(violent ~ poverty + single + metro + white + highschool, data=af_crime93, se = "hetero"))
## OLS estimation, Dep. Var.: violent
## Observations: 51
## Standard-errors: Heteroskedasticity-robust
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1795.9000 751.0300 -2.3913 0.021033 *
## poverty 26.2440 11.4620 2.2897 0.026781 *
## single 109.4700 24.4210 4.4824 0.000050 ***
## metro 7.6088 1.6439 4.6285 0.000031 ***
## white -4.4829 3.4354 -1.3049 0.198554
## highschool 8.6464 8.0841 1.0696 0.290518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 169.2 Adj. R2: 0.833152
Let’s compare these heteroskedasticity corrected standard errors with the naive OLS standard errors.
::tidy(M5) %>% mutate(cat = "OLS") %>%
broombind_rows(., broom::tidy(M8) %>% mutate(cat = "HCSE")) %>%
arrange(term)
## # A tibble: 12 x 6
## term estimate std.error statistic p.value cat
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) -1796. 669. -2.69 0.0101 OLS
## 2 (Intercept) -1796. 751. -2.39 0.0210 HCSE
## 3 highschool 8.65 7.83 1.10 0.275 OLS
## 4 highschool 8.65 8.08 1.07 0.291 HCSE
## 5 metro 7.61 1.30 5.87 0.000000480 OLS
## 6 metro 7.61 1.64 4.63 0.0000313 HCSE
## 7 poverty 26.2 11.1 2.37 0.0222 OLS
## 8 poverty 26.2 11.5 2.29 0.0268 HCSE
## 9 single 109. 20.4 5.38 0.00000260 OLS
## 10 single 109. 24.4 4.48 0.0000504 HCSE
## 11 white -4.48 2.78 -1.61 0.114 OLS
## 12 white -4.48 3.44 -1.30 0.199 HCSE
The biggest inferential implications are for the white
coefficient, but do note the t-statistics for single
and metro
dropped considerably as well.
You could also go back to some basics and acknowledge that outliers (i.e. DC) and right-skewed data are responsible for flunking the assumption of homoskedastic error variances. We could transform the data to impose some kind of normality here. First, let’s check if we have any 0s here.
%>% summary af_crime93
## state violent murder poverty
## Length:51 Min. : 82.0 Min. : 1.600 Min. : 8.00
## Class :character 1st Qu.: 326.5 1st Qu.: 3.900 1st Qu.:10.70
## Mode :character Median : 515.0 Median : 6.800 Median :13.10
## Mean : 612.8 Mean : 8.727 Mean :14.26
## 3rd Qu.: 773.0 3rd Qu.:10.350 3rd Qu.:17.40
## Max. :2922.0 Max. :78.500 Max. :26.40
## single metro white highschool
## Min. : 8.40 Min. : 24.00 Min. :31.80 Min. :64.30
## 1st Qu.:10.05 1st Qu.: 49.55 1st Qu.:79.35 1st Qu.:73.50
## Median :10.90 Median : 69.80 Median :87.60 Median :76.70
## Mean :11.33 Mean : 67.39 Mean :84.11 Mean :76.22
## 3rd Qu.:12.05 3rd Qu.: 83.95 3rd Qu.:92.60 3rd Qu.:80.10
## Max. :22.10 Max. :100.00 Max. :98.50 Max. :86.60
## wts
## Min. :9.478e-06
## 1st Qu.:4.488e-05
## Median :5.726e-05
## Mean :6.952e-05
## 3rd Qu.:7.713e-05
## Max. :1.988e-04
We don’t. So, let’s log-transform the DV and regress that on the x variables.
summary(M9 <- lm(log(violent) ~ poverty + single + metro + white + highschool, data=af_crime93))
##
## Call:
## lm(formula = log(violent) ~ poverty + single + metro + white +
## highschool, data = af_crime93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.87452 -0.21997 0.00369 0.29477 0.70716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.862626 1.391604 2.776 0.0080 **
## poverty 0.029121 0.023062 1.263 0.2132
## single 0.126712 0.042365 2.991 0.0045 **
## metro 0.017229 0.002695 6.393 8.14e-08 ***
## white -0.001451 0.005783 -0.251 0.8030
## highschool -0.007196 0.016284 -0.442 0.6607
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3749 on 45 degrees of freedom
## Multiple R-squared: 0.7336, Adjusted R-squared: 0.704
## F-statistic: 24.79 on 5 and 45 DF, p-value: 6.633e-12
bptest(M9)
##
## studentized Breusch-Pagan test
##
## data: M9
## BP = 9.221, df = 5, p-value = 0.1006
And we passed. Barely at the .10 level, but I’ll count that. Here, log-transforming the DV helped considerably. The log-transformed DV does have the effect of suggesting there is no poverty effect.
Finally, we can do what I think is one of the greatest parlor tricks in statistics: bootstrapping. I’m going to have to punt on the theory and exact principles of bootstrapping to classic texts, but the intuition here is, in the absence of a true “population”, to treat the sample as a population of sorts and sample, with replacement from that. There’s a lot more to say about this that I don’t have the space here. If you have some complicated data structures, you’ll need to adjust your bootstrapping to account for that.
The {modelr}
package will make the bootstrapping simpler for us. First, let’s sample, with replacement, from the af_crime93
data 1,000 times. The data we have are small, so we can’t get too crazy. Efron I think recommends 250-500, but 1,000 is a nice number that is often a nice target in the sampling framework.
set.seed(8675309) # Jenny, I got your number...
%>%
af_crime93 bootstrap(1000) -> bootCrime
Notice we have 1,000 resamples of the original data.
bootCrime
## # A tibble: 1,000 x 2
## strap .id
## <list> <chr>
## 1 <resample> 0001
## 2 <resample> 0002
## 3 <resample> 0003
## 4 <resample> 0004
## 5 <resample> 0005
## 6 <resample> 0006
## 7 <resample> 0007
## 8 <resample> 0008
## 9 <resample> 0009
## 10 <resample> 0010
## # … with 990 more rows
Here’s the first one. Interestingly, Colorado appears in this five times. That seems like it’s oversampling Colorado, which it would be if this were our only resample. However, there will be cases where Colorado doesn’t appear at all. That’s the idea here.
%>%
bootCrime slice(1) %>% # grab the first row in this special tbl
pull(strap) %>% # focus on the resample, momentarily a full list
as.data.frame() %>% # cough up the full data
as_tibble() %>%
arrange(state) %>% # arrange by state, for ease of reading
select(-murder, -wts)
## # A tibble: 51 x 7
## state violent poverty single metro white highschool
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL 780 17.4 11.5 67.4 73.5 66.9
## 2 AR 593 20 10.7 44.7 82.9 66.3
## 3 CA 1078 18.2 12.5 96.7 79.3 76.2
## 4 CO 567 9.90 12.1 81.8 92.5 84.4
## 5 CO 567 9.90 12.1 81.8 92.5 84.4
## 6 CO 567 9.90 12.1 81.8 92.5 84.4
## 7 CO 567 9.90 12.1 81.8 92.5 84.4
## 8 CO 567 9.90 12.1 81.8 92.5 84.4
## 9 CT 456 8.5 10.1 95.7 89 79.2
## 10 DE 686 10.2 11.4 82.7 79.4 77.5
## # … with 41 more rows
Now, let’s use {purrr}
magic and run our regression on each of these resamples. Then, we’re going to tidy them up, pull the results, and create a nice tidy data frame of those.
%>%
bootCrime mutate(lm = map(strap, ~lm(violent ~ poverty + single + metro + white + highschool,
data = .)),
tidy = map(lm, broom::tidy)) %>%
pull(tidy) %>%
map2_df(., # map to return a data frame
seq(1, 1000), # make sure to get this seq right. We did this 1000 times.
~mutate(.x, resample = .y)) -> tidybootCrime
Now, in a case like this, your bootstrapped regression parameter (i.e. coefficient or intercept) is the mean of the estimate and your bootstrapped standard error is the standard deviation of the estimate. That part may not be obvious. It’s not the mean of the standard errors, it’s the standard deviation of the coefficient.
%>%
tidybootCrime group_by(term) %>%
summarize(boot_est = mean(estimate),
boot_se = sd(estimate)) -> bootsums
bootsums
## # A tibble: 6 x 3
## term boot_est boot_se
## * <chr> <dbl> <dbl>
## 1 (Intercept) -1377. 798.
## 2 highschool 8.29 8.93
## 3 metro 7.27 1.59
## 4 poverty 25.3 11.7
## 5 single 93.5 34.3
## 6 white -6.54 5.39
You can compare the results of this bootstrapped summary to your “main” model. FWIW, you can bootstrap the coefficient as well but most practical interest is in what bootstrapping does in comparison to the standard error of your OLS model. So, it might make more sense to just compare the standard errors.
::tidy(M5) %>%
broommutate(cat = "OLS") %>%
bind_rows(., bootsums %>% rename(estimate = boot_est, `std.error` = boot_se) %>% mutate(cat = "Bootstrapped")) %>%
arrange(term) %>%
mutate(statistic = ifelse(is.na(statistic), estimate/std.error, statistic)) %>%
select(-p.value)
## # A tibble: 12 x 5
## term estimate std.error statistic cat
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) -1796. 669. -2.69 OLS
## 2 (Intercept) -1377. 798. -1.72 Bootstrapped
## 3 highschool 8.65 7.83 1.10 OLS
## 4 highschool 8.29 8.93 0.929 Bootstrapped
## 5 metro 7.61 1.30 5.87 OLS
## 6 metro 7.27 1.59 4.58 Bootstrapped
## 7 poverty 26.2 11.1 2.37 OLS
## 8 poverty 25.3 11.7 2.17 Bootstrapped
## 9 single 109. 20.4 5.38 OLS
## 10 single 93.5 34.3 2.72 Bootstrapped
## 11 white -4.48 2.78 -1.61 OLS
## 12 white -6.54 5.39 -1.21 Bootstrapped
In this case, bootstrapping leads to much more diffuse standard errors for the white
variable and, to a lesser extent, the single-parent household variable. The latter is not affected from a null hypothesis test standpoint.
The superlative of this approach is that it’s super-flexible and you can do a lot of cool things. Downsides: it can be a bit time-consuming with larger data sets. Data sets that would violate the independence of errors would require additional considerations.
No matter, nothing here is necessarily a “fix.” When you have heteroskedasticity, refit the model a few ways and see what happens to your standard errors. Be prepared to report this in an appendix.