Elsewhere on My Website

Some of what I write here is taken from my website. Please read these:

Be advised I might plagiarize myself here.

R Packages/Data for This Session

We’ll be using {stevedata} for some data and {tidyverse} for most things. I’m assuming you have yet to install {modelr}, {fixest}, and {lmtest}. We’ll need those for this session.

packs <- c("lmtest", "fixest", "modelr")
new_pack <- packs[!(packs %in% installed.packages()[,"Package"])]
if(length(new_pack)) install.packages(new_pack)

Run that, and then let’s load them.

library(stevedata)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(fixest)
library(modelr)
options(warn=-1)

We’re going to tailor this lab script to what you should do when you violate one of a few core assumptions of OLS. We’ll focus on four here: linearity, normality, independence of errors, and constant error variances. In each case, we’ll talk about what the assumption means, what diagnostics you have at your disposal to see whether your data are violating the model assumptions, and what you can do about them.

Linearity

One of the assumptions of the OLS model is that the outcome y is a linear function of the explanatory variable(s). Models that violate the linearity assumption amount to a specification error of the kind described by Gujarati. In a situation like this, the linear model is not appropriate to describe the relationship between y and the explanatory variable(s). This amounts to a regression model that will be biased.

First, let’s do this with the data/application used by Meuleman et al. (2019). In their case, they want to regress trust in the police in Belgium on an individual respondent’s age (in years), gender (dummy for women), education (in years of full-time education), income (four-categories), and an assessment of how successful the police are at preventing criminality (on an 11-point scale). Here’s the naive linear model.

summary(M1 <- lm(trstplc ~ agea + female + eduyrs +  hincfel + plcpvcr, data=ESSBE5,
                 na.action=na.exclude))
## 
## Call:
## lm(formula = trstplc ~ agea + female + eduyrs + hincfel + plcpvcr, 
##     data = ESSBE5, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1179 -1.0444  0.1995  1.2275  6.7873 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.824357   0.287112  13.320  < 2e-16 ***
## agea        -0.007021   0.002517  -2.789  0.00534 ** 
## female       0.003301   0.091945   0.036  0.97136    
## eduyrs       0.052333   0.012660   4.134 3.75e-05 ***
## hincfel     -0.267282   0.055594  -4.808 1.67e-06 ***
## plcpvcr      0.433734   0.024958  17.379  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.854 on 1642 degrees of freedom
##   (56 observations deleted due to missingness)
## Multiple R-squared:  0.1791, Adjusted R-squared:  0.1766 
## F-statistic: 71.63 on 5 and 1642 DF,  p-value: < 2.2e-16

Pro tip: if you have NAs and you don’t care to drop those observations because you’re lazy and you want lm() to drop them for you, specify na.action=na.exclude in the formula. This will allow for some more convenience with helper functions.

Checking the Linearity Assumption

The authors describe an F-test for lack of fit. Truthfully, I’ve never liked this test. The intuition is straightforward but I’ve never seen an application of it that I liked in the R programming language. I think you can assess the linearity diagnostic graphically. You can do it with the fitted-residual plot.

plot(M1, which=1)

Concerns about the somewhat discrete nature of the DV aside, I do see a slight, upward bow there for the higher fitted values.

I also really like the authors’ suggestion to compare a LOESS smoother to the the linear line in a faceted scatter plot. Here’s how you’d do that.

ESSBE5 %>%
  # Select only the variables we need
  select(trstplc:plcpvcr) %>%
  # tidyr::gather from long to even longer.
  # This creates a handy three column data frame where each value of each IV is matched with the DV for all observations
  gather(var, value, 2:ncol(.)) %>%
  # Create a ggplot where the x axis is the value of a given x variable.
  ggplot(.,aes(value, trstplc)) +
  # Create your facet now since the var variable is the particular x variable.
  # Make sure to set scale="free_x" because these x variables are all on different scales
  facet_wrap(~var, scale="free_x") +
  # scatterplot
  geom_point() +
  # linear smoother
  geom_smooth(method="lm") +
  # loess smoother, with different color
  geom_smooth(method="loess", color="black")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Here’s how you should interpret a plot like this: how much does the LOESS smoother and the LM smoother differ? My takeaway: not a whole lot. Be mindful the LOESS smoother will want to travel to the outlier. The cool thing about this approach is that a scatter plot like this will point you to some odd observations that you may want to punt out. For example, this is the second time I’ve seen a European respondent say they have 40+ years of full-time education. Which, c’mon, fam.

Potential Fixes for Non-Linearity

If you see any type of “U” shape, this implies a non-linear effect of x on y. The solution here is some kind of polynomial. The analog here with fake/hypothetical data would be something like lm(y ~ x + I(x^2), data=Data). More complicated shapes imply the need for some higher-order polynomial. My own hot #take here is you’re kind of losing the plot a little bit with the higher-order polynomials. In your own way, you’re fitting the model to the data (which is fine even as political scientists don’t like that practice much) and you’re still putting yourself in a position where you have to explain the results in a simple and coherent way.

Depending on your particular application, you may also want to identify the particular x variable that is offending a linear relationship and break into some kind of category variable and treat it as a fixed effect. Here is a simple example (and one that could also be modeled by a polynomial). There is a well-known “U” effect of age on happiness. Age may not be a “treatment”, but there are still a lot of studies that suggest the 20-year-olds are relatively happy people and the retirees are relatively happy. It’s the people in their 30s and 40s who, like me, are miserable, depressed bastards. If you look at that “shape of happiness” chart, maybe you can do something like create an age bracket for the 16-35 group, another for the 36-60 group, and another for the 61 and above group. Treat that as a fixed effect, possibly with the 36-60 people as the baseline.

There’s a mixed effects model comment I want to add here, but that really is more of an advanced topic.

Normality

OLS assumes the errors are normally distributed. This is often conflated with an assumption that the outcome variable is normally distributed. That’s not quite what it is. It does imply that the conditional distribution of the dependent variable is normal but that is not equivalent to assuming the marginal distribution of the dependent variable is normal. At the end of the day, the assumption of normality is more about the errors than the dependent variable even as the assumption about the former does strongly imply an assumption about the latter.

Violating the assumption of a normal distribution of the errors is not as severe as a violation of some of the other assumptions. The normality assumption is not necessary for point estimates to be unbiased. In one prominent textbook on statistical methods, Gelman and Hill (2007, p. 46) say the normality assumption is not important at all because it has no strong implication for the regression line. I think this follows because Gelman and Hill (2007)—later Gelman, Hill, and Vehtari (2020)—are nowhere near as interested in null hypothesis testing as your typical social scientist likely is. No matter, violating the assumption of a normal distribution of errors has some implication for drawing a line that reasonably approximates individual data points (if not the line itself, per se). Thus, you may want to check it, certainly if you have a small data set.

Checking the Normality Assumption

There are any number of diagnostics here for testing whether the residuals are normally distributed. Basically: gank your model’s residuals and check if they are consistent with an assumption that they are drawn from a normal distribution. You can do this with a Shapiro-Wilk test or a Kolmogorov-Smirnov test.

shapiro.test(resid(M1))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(M1)
## W = 0.97739, p-value = 1.905e-15
ks.test(resid(M1), y=pnorm)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(M1)
## D = 0.15193, p-value < 2.2e-16
## alternative hypothesis: two-sided

In both applications, the assumption of a normal distribution of errors is violated. One caveat here might be that normality tests are sensitive to deviations from normality when sample sizes are large. They’re also sensitive to really any kind of discreteness in the dependent variable and any kind of heaping at extremes under those conditions.

You could also use our ol’ friend the Q-Q plot to show the same thing.

plot(M1, which=2)

There’s added information from the Q-Q plot since it’s saying that the negative residuals are more extreme than expected. Alas, compare that plot with an actual density plot of the residuals juxtaposed a normal density function with a mean of 0 and a standard deviation equal to the standard deviation of the residuals.

sd_resid <- sd(as.numeric(resid(M1)), na.rm=T)


ESSBE5 %>%
  mutate(resid = resid(M1)) %>% 
  ggplot(.,aes(resid)) + 
  geom_density() +
  stat_function(fun = dnorm,
                args = list(mean = 0, sd = sd_resid),
                linetype="dashed", size=1.5)

Sure, it’s not normally distributed, but I also don’t think it wrong to say the normality assumption is approximated well even in a case where the DV has just 10 values.

Potential Fixes for Non-Normality of the Errors

This one is your call. In most applications where this is a severe issue, it’s because you’re trying to impose OLS on a binary DV or a five-item Likert. I will yell at you and throw things at you for doing that, but not necessarily for reasons related to this violation of OLS. There are rules of thumb for when you can take a technically ordinal/discrete DV and treat it as something you can model through OLS. In this case, I think this is fine and I agree with the authors doing it. In the case of a five-item Likert or a binary DV, it’s not fine and there’s better/more information to be gathered from a more appropriate model.

Further, it’s tough not to divorce the normality assumption from the homoskedasticity assumption even if they are separate. Both travel together in my applications even as they are clearly separate assumptions of OLS. Do you have outliers in your dependent variable? If so, consider some kind of transformation. Do you have cornball outliers in your independent variables (e.g. that one Belgian who said s/he had 40+ years of full-time education)? If so, consider some kind of transformation or even punting that observation out the model.

Independence (i.e. No Autocorrelation)

This one is a biggie. OLS assumes that observations are sampled independently and that any pair of errors are unrelated to each other. In formal terms, this assumption seems kind of Greek to students. In practice, think of it this way. A lot of the theoretical intuition behind OLS is assuming something akin to a simple random sample of the population. You probably learned central limit theorem by this point and how you can simulate that. However, you may have a data set that does not look like this.

In other words, this assumption is going to be violated like mad in any design that has a time-series, spatial, or multilevel component. For students new to this stuff, you should be able to anticipate ahead of time when your errors are no longer independent of each other.

The implications of non-independent errors—aka “serial correlation” or “autocorrelation”—are generally more about the variance of the estimator than the estimator itself. That said, there is reason to believe the standard errors are wrong and this can have important implications for statistical inference you’d like to do.

Checking the Independence (i.e. No Autocorrelation) Assumption

Honestly? Once you know under what conditions that autocorrelation is going to happen, you can probably skip this and go right to an appropriate model designed to deal with the nature of your data problem. The most common tests are the Breusch-Godfrey test and the Durbin-Watson test. Of the two, practitioners I’ve read seem to favor the former over the latter. There is also an issue as to what kind of higher-order autocorrelation you’re trying to detect. Both come in the {lmtest} package.

bgtest(M1)
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  M1
## LM test = 2.1561, df = 1, p-value = 0.142
dwtest(M1)
## 
##  Durbin-Watson test
## 
## data:  M1
## DW = 2.0721, p-value = 0.9259
## alternative hypothesis: true autocorrelation is greater than 0

In our case, since the Belgium data are a simple sample of a population, there doesn’t seem to be any concern for serial correlation. The p-values above a particular threshold (i.e. .05) mean the model is consistent with a null hypothesis of no autocorrelation.

If you’re curious what it would look like to flunk one of these tests, let’s use some fake data I created for precisely this purpose.

fakeTSD
## # A tibble: 100 x 5
##     year     y    x1    x2      e
##    <int> <dbl> <dbl> <dbl>  <dbl>
##  1  1920  501.  3.01     1 -0.319
##  2  1921  504.  6.44     1  0.915
##  3  1922  503.  3.77     0  1.42 
##  4  1923  502.  9.06     0 -1.10 
##  5  1924  503.  7.13     1 -0.605
##  6  1925  503.  6.97     1 -1.23 
##  7  1926  501.  5.05     0 -2.09 
##  8  1927  502.  6.35     0 -0.934
##  9  1928  504.  6.14     0  0.296
## 10  1929  504.  6.81     0 -0.398
## # … with 90 more rows
?fakeTSD

These data were created to mimic a time-series type phenomenon where the outcome y is a linear function of 20 + (.25*year) + .25*x1 + 1*x2 + e. x1 is a continuous variable and x2 is a dummy variable. Of note: year determines the data-generating process, but “year” isn’t really a variable of interest. I.e. it’s a time-series.

Here is a plot of y by increasing year.

fakeTSD %>%
  ggplot(.,aes(year, y)) + geom_line() +
  scale_x_continuous(breaks = seq(1920, 2020, by = 10))

Looks like a time-series, because it is a time-series.

Let’s see if we can go-go-gadget this through OLS.

summary(M2 <- lm(y ~ x1 + x2, data=fakeTSD))
## 
## Call:
## lm(formula = y ~ x1 + x2, data = fakeTSD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3828  -6.2769   0.0628   6.6317  13.0777 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.134e+02  2.368e+00 216.764   <2e-16 ***
## x1          2.871e-03  4.061e-01   0.007    0.994    
## x2          1.407e+00  1.507e+00   0.934    0.353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.476 on 97 degrees of freedom
## Multiple R-squared:  0.00898,    Adjusted R-squared:  -0.01145 
## F-statistic: 0.4395 on 2 and 97 DF,  p-value: 0.6456

Nooope. Let’s get our L from the Breusch-Godfrey test and the Durbin-Watson test.

bgtest(M2)
## 
##  Breusch-Godfrey test for serial correlation of order up to 1
## 
## data:  M2
## LM test = 92.359, df = 1, p-value < 2.2e-16
dwtest(M2)
## 
##  Durbin-Watson test
## 
## data:  M2
## DW = 0.05506, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

Oh yeah, we earned that L.

Potential Fixes for Autocorrelation

You need a new model. In our fake time-series data, you can still reasonably approximate the true effects by including a lagged dependent variable as a predictor or by calculating y - lag(y) and regressing that difference on x1 and x2. No matter, the diagnostics for autocorrelation will still complain and practitioners are generally not sanguine about lagged DVs for a variety of reasons.

To reiterate the point: you need a new model. This class doesn’t have the opportunity to teach about some advanced topics, like time-series modeling, mixed effects models, and the like. No matter, you’ll want a model that explicitly accounts for what would be the source of autocorrelation. If I could teach an entire class about the cool shit you can do with a mixed effects model, I would.

Homoskedasticity (Constant Error Variances)

OLS assumes that the variance of the errors is the same regardless of the particular value of x. This is often called “homoskedasticity”, mostly in juxtaposition to the violation of this assumption: “heteroskedasticity.” The change of prefix should be clear to what is implied here. Heteroskedasticity often means the errors are typically a function of something else in the model.

The implications for violating the assumption of homoskedasticity are more nuisance than real problem. Unequal/non-constant variance does not affect the XB part of the regression model, but it does have implications for predicting individual data points. Here, the implication is similar to the implication of violating the assumption of normally distributed errors. Indeed, those two typically travel together in my experiences.

Checking the Homoskedasticity Assumption

Our ol’ pal, the fitted-residual plot, should be illustrative for most things. First, let’s show what it looks like in an ideal case.

set.seed(8675309)
tibble(
  x = rnorm(1000, 5, 2),
  e = rnorm(1000, 0, 1),
  y = 1 + .25*x + e
) -> C # for constant

summary(M3 <- lm(y ~ x, data=C))
## 
## Call:
## lm(formula = y ~ x, data = C)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8524 -0.7030 -0.0324  0.6928  3.5313 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.95215    0.08499   11.20   <2e-16 ***
## x            0.26419    0.01588   16.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.026 on 998 degrees of freedom
## Multiple R-squared:  0.2171, Adjusted R-squared:  0.2163 
## F-statistic: 276.8 on 1 and 998 DF,  p-value: < 2.2e-16

^ The model is well identified.

C %>%
  mutate(resid = resid(M3),
         fitted = fitted(M3)) %>%
  ggplot(.,aes(fitted, resid)) +
  geom_point()

^ The fitted-residual plot looks fine.

Now, let’s deliberately violate this assumption. Here, we’ll go with the most obvious manifestation of heteroskedasticity: the “cone of shame.” Notice that the errors are an increasing function of the x variable.

tibble(
  x = rnorm(1000, 5, 2),
  e = rnorm(1000, 0, 1 + .25*x), # uh-oh...
  y = 1 + .25*x + e
) -> H # for heteroskedasticity

summary(M4 <- lm(y ~ x, data=H))
## 
## Call:
## lm(formula = y ~ x, data = H)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4465 -1.4774 -0.0426  1.4169  8.8592 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.96810    0.18824   5.143 3.26e-07 ***
## x            0.24828    0.03499   7.096 2.44e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.276 on 998 degrees of freedom
## Multiple R-squared:  0.04803,    Adjusted R-squared:  0.04708 
## F-statistic: 50.35 on 1 and 998 DF,  p-value: 2.435e-12

The model still looks well identified, but notice the standard errors went up.

H %>%
  mutate(resid = resid(M4),
         fitted = fitted(M4)) %>%
  ggplot(.,aes(fitted, resid)) +
  geom_point()

Yeah, I see the cone of shame here. This is a pretty tell-tale case of heteroskedasticity.

There is a formal test for this, though: the Breusch-Pagan test, which comes as bptest() in the {lmtest} package. The null hypothesis in the test is homoskedasticity. If you can reject that (via low-enough p-value), you have evidence of heteroskedasticity. Observe:

bptest(M3)
## 
##  studentized Breusch-Pagan test
## 
## data:  M3
## BP = 1.1241, df = 1, p-value = 0.289
bptest(M4)
## 
##  studentized Breusch-Pagan test
## 
## data:  M4
## BP = 76.113, df = 1, p-value < 2.2e-16

For giggles, observe that these data sets also imply a non-normal distribution of the errors. In this case, the Shapiro-Wilk test didn’t see a problem but the Kolmogorov-Smirnov does. To be clear, they are different assumptions about OLS. But, in my experience, if I have one, I have the other too.

shapiro.test(resid(M3))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(M3)
## W = 0.99805, p-value = 0.3037
shapiro.test(resid(M4))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(M4)
## W = 0.99834, p-value = 0.4541
ks.test(resid(M3), y=pnorm)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(M3)
## D = 0.026671, p-value = 0.4754
## alternative hypothesis: two-sided
ks.test(resid(M4), y=pnorm)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(M4)
## D = 0.18307, p-value < 2.2e-16
## alternative hypothesis: two-sided

Potential Fixes for Heteroskedasticity

There is no shortage of fixes for heteroskedasticity. First, I want to bring in a new data set that I love using for this purpose because it’s super illustrative of what heteroskedasticity will look like in a simple case. af_crime93 comes via Table 9.1 of the 3rd edition of Agresti and Finlay’s Statistical Methods for the Social Sciences. The data come from the Statistical Abstract of the United States and most variables were measured in 1993.

I have this in the proto version of my {stevedata} package right now, but you can load it off the course website.

af_crime93 <- readRDS(url("http://post8000.svmiller.com/lab-scripts/af_crime93.rds"))
af_crime93
## # A tibble: 51 x 8
##    state violent murder poverty single metro white highschool
##    <chr>   <dbl>  <dbl>   <dbl>  <dbl> <dbl> <dbl>      <dbl>
##  1 AK        761   9       9.10   14.3  41.8  75.2       86.6
##  2 AL        780  11.6    17.4    11.5  67.4  73.5       66.9
##  3 AR        593  10.2    20      10.7  44.7  82.9       66.3
##  4 AZ        715   8.60   15.4    12.1  84.7  88.6       78.7
##  5 CA       1078  13.1    18.2    12.5  96.7  79.3       76.2
##  6 CO        567   5.80    9.90   12.1  81.8  92.5       84.4
##  7 CT        456   6.30    8.5    10.1  95.7  89         79.2
##  8 DE        686   5      10.2    11.4  82.7  79.4       77.5
##  9 FL       1206   8.90   17.8    10.6  93    83.5       74.4
## 10 GA        723  11.4    13.5    13    67.7  70.8       70.9
## # … with 41 more rows

In this application, Agresti and Finlay try to model the state’s violent crime rate (per 100,000 people in the population) as a function of 1) the percent of the state with income below the poverty level (poverty), 2) the percent of families in the state headed by a single parent (single), 3) the percent of the population living in metropolitan areas (metro), 4) the percentage of the state that is white (white), and 5) the percent of the state that graduated from high school (highschool).

Let’s run a naive model and see what happens.

summary(M5 <- lm(violent ~ poverty + single + metro + white + highschool, data=af_crime93))
## 
## Call:
## lm(formula = violent ~ poverty + single + metro + white + highschool, 
##     data = af_crime93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -533.27  -87.19   -9.82  110.46  397.82 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1795.904    668.788  -2.685   0.0101 *  
## poverty        26.244     11.083   2.368   0.0222 *  
## single        109.467     20.360   5.377  2.6e-06 ***
## metro           7.609      1.295   5.874  4.8e-07 ***
## white          -4.483      2.779  -1.613   0.1137    
## highschool      8.646      7.826   1.105   0.2751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 180.2 on 45 degrees of freedom
## Multiple R-squared:  0.8498, Adjusted R-squared:  0.8332 
## F-statistic: 50.93 on 5 and 45 DF,  p-value: < 2.2e-16

It looks like the poverty, single-parent family, and metro variables are statistically significant. The white variable looks like it’s almost significant at the .10 level too.

Cool, let’s do the Breusch-Pagan test.

bptest(M5)
## 
##  studentized Breusch-Pagan test
## 
## data:  M5
## BP = 15.397, df = 5, p-value = 0.008793

…and we flunked it. All right, let’s take some inventory as to what happened here with the fitted-residual plot.

af_crime93 %>%
  mutate(fitted = fitted(M5),
         resid = resid(M5)) %>%
  ggplot(.,aes(fitted, resid)) + geom_point()

Oh wow. Fitted-residual plots shouldn’t look like this. The post on my website offers a bit more information as to the offending cases, but let’s go identify what seem to be the biggest offenders here.

af_crime93 %>%
  mutate(resid = resid(M5)) %>%
  filter(abs(resid) >= 300)
## # A tibble: 4 x 9
##   state violent murder poverty single metro white highschool resid
##   <chr>   <dbl>  <dbl>   <dbl>  <dbl> <dbl> <dbl>      <dbl> <dbl>
## 1 FL       1206   8.90    17.8   10.6  93    83.5       74.4  398.
## 2 LA       1062  20.3     26.4   14.9  75    66.7       68.3 -328.
## 3 MS        434  13.5     24.7   14.7  30.7  63.3       64.3 -533.
## 4 DC       2922  78.5     26.4   22.1 100    31.8       73.1  355.

So, what can we do here? Well, for one, I’m a little convinced just from eyeballing these data that DC is a weird state (i.e. not technically a state, but should be a state). In a lot of ways, DC looks nothing like the other 50 because of the nature of its territorial endowment and population. It’s a city-state.

First, let me verify that intuition.

af_crime93 %>%
  select(-murder) %>%
  gather(var, val, 3:ncol(.)) %>%
  ggplot(.,aes(val, violent)) +
  geom_point() + facet_wrap(~var, scale="free_x")

af_crime93 %>%
  filter(state == "DC")
## # A tibble: 1 x 8
##   state violent murder poverty single metro white highschool
##   <chr>   <dbl>  <dbl>   <dbl>  <dbl> <dbl> <dbl>      <dbl>
## 1 DC       2922   78.5    26.4   22.1   100  31.8       73.1

Yeah, that’s DC.

One thing I can do here is just punt out DC and see what happens.

summary(M6 <- lm(violent ~ poverty + single + metro + white + highschool, data=subset(af_crime93, state != "DC")))
## 
## Call:
## lm(formula = violent ~ poverty + single + metro + white + highschool, 
##     data = subset(af_crime93, state != "DC"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -393.47 -117.63   -8.91  122.64  404.11 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1173.073    632.397  -1.855 0.070309 .  
## poverty        21.278     10.123   2.102 0.041321 *  
## single         80.893     20.290   3.987 0.000249 ***
## metro           7.542      1.170   6.444  7.5e-08 ***
## white          -2.486      2.581  -0.963 0.340774    
## highschool      3.292      7.250   0.454 0.652057    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 162.8 on 44 degrees of freedom
## Multiple R-squared:  0.7282, Adjusted R-squared:  0.6973 
## F-statistic: 23.58 on 5 and 44 DF,  p-value: 1.919e-11

One takeaway I’m seeing already (and wrote about on my blog a bit). The discernibility of that white coefficient is almost entirely a function of DC. DC is a majority non-white (should-be-a) state and has a high violent crime rate in these data. There really is no discernible effect of the white coefficient without it.

But, let’s see what the Breusch-Pagan test says.

bptest(M6)
## 
##  studentized Breusch-Pagan test
## 
## data:  M6
## BP = 5.9924, df = 5, p-value = 0.307

…and we passed. DC alone will flunk a Breusch-Pagan test in this application. If you have a simple case like this where it’s just one observation that is messing with an overall picture, you can punt it out.

The textbook solution is a weighted least squares (WLS) estimation. The process here is a little convoluted and there is a simple wrapper for it somewhere (I’m sure), but here’s what the process looks like. 1) Run the model (which we already did). 2) Gank the residuals and fitted values. 3) Regress the absolute value of the residuals on the fitted values of the original model. 4) Extract those fitted values. 5) Square them and 6) divide 1 over those values. 7) Finally, apply those as weights in the linear model once more for a re-estimation.

af_crime93 %>%
  mutate(wts =  1/fitted(lm(abs(residuals(M5))~fitted(M5)) )^2) -> af_crime93


summary(M7 <- lm(violent ~ poverty + single + metro + white + highschool, data=af_crime93, weights=wts))
## 
## Call:
## lm(formula = violent ~ poverty + single + metro + white + highschool, 
##     data = af_crime93, weights = wts)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.93201 -0.89660 -0.07732  0.85497  2.70104 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1403.562    628.576  -2.233   0.0306 *  
## poverty        25.593     10.252   2.496   0.0163 *  
## single         88.909     18.979   4.685 2.61e-05 ***
## metro           6.663      1.117   5.964 3.53e-07 ***
## white          -3.548      2.472  -1.435   0.1582    
## highschool      6.400      7.052   0.907   0.3690    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.268 on 45 degrees of freedom
## Multiple R-squared:  0.7827, Adjusted R-squared:  0.7585 
## F-statistic: 32.42 on 5 and 45 DF,  p-value: 7.452e-14

We can compare the unweighted OLS with the WLS estimates.

broom::tidy(M5) %>% mutate(cat = "OLS") %>%
  bind_rows(., broom::tidy(M7) %>% mutate(cat = "WLS")) %>%
  arrange(term)
## # A tibble: 12 x 6
##    term        estimate std.error statistic     p.value cat  
##    <chr>          <dbl>     <dbl>     <dbl>       <dbl> <chr>
##  1 (Intercept) -1796.      669.      -2.69  0.0101      OLS  
##  2 (Intercept) -1404.      629.      -2.23  0.0306      WLS  
##  3 highschool      8.65      7.83     1.10  0.275       OLS  
##  4 highschool      6.40      7.05     0.907 0.369       WLS  
##  5 metro           7.61      1.30     5.87  0.000000480 OLS  
##  6 metro           6.66      1.12     5.96  0.000000353 WLS  
##  7 poverty        26.2      11.1      2.37  0.0222      OLS  
##  8 poverty        25.6      10.3      2.50  0.0163      WLS  
##  9 single        109.       20.4      5.38  0.00000260  OLS  
## 10 single         88.9      19.0      4.68  0.0000261   WLS  
## 11 white          -4.48      2.78    -1.61  0.114       OLS  
## 12 white          -3.55      2.47    -1.43  0.158       WLS

You could also do some on-the-fly heteroskedasticity corrections to your standard errors. There are a lot to choose from here, and some are more tedious than others. I’ll use the {fixest} package here because it’s the most convenient. You may also want to look into the {sandwich} package as well.

summary(M8 <- feols(violent ~ poverty + single + metro + white + highschool, data=af_crime93, se = "hetero"))
## OLS estimation, Dep. Var.: violent
## Observations: 51 
## Standard-errors: Heteroskedasticity-robust 
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1795.9000   751.0300 -2.3913 0.021033 *  
## poverty        26.2440    11.4620  2.2897 0.026781 *  
## single        109.4700    24.4210  4.4824 0.000050 ***
## metro           7.6088     1.6439  4.6285 0.000031 ***
## white          -4.4829     3.4354 -1.3049 0.198554    
## highschool      8.6464     8.0841  1.0696 0.290518    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 169.2   Adj. R2: 0.833152

Let’s compare these heteroskedasticity corrected standard errors with the naive OLS standard errors.

broom::tidy(M5) %>% mutate(cat = "OLS") %>%
  bind_rows(., broom::tidy(M8) %>% mutate(cat = "HCSE")) %>%
  arrange(term)
## # A tibble: 12 x 6
##    term        estimate std.error statistic     p.value cat  
##    <chr>          <dbl>     <dbl>     <dbl>       <dbl> <chr>
##  1 (Intercept) -1796.      669.       -2.69 0.0101      OLS  
##  2 (Intercept) -1796.      751.       -2.39 0.0210      HCSE 
##  3 highschool      8.65      7.83      1.10 0.275       OLS  
##  4 highschool      8.65      8.08      1.07 0.291       HCSE 
##  5 metro           7.61      1.30      5.87 0.000000480 OLS  
##  6 metro           7.61      1.64      4.63 0.0000313   HCSE 
##  7 poverty        26.2      11.1       2.37 0.0222      OLS  
##  8 poverty        26.2      11.5       2.29 0.0268      HCSE 
##  9 single        109.       20.4       5.38 0.00000260  OLS  
## 10 single        109.       24.4       4.48 0.0000504   HCSE 
## 11 white          -4.48      2.78     -1.61 0.114       OLS  
## 12 white          -4.48      3.44     -1.30 0.199       HCSE

The biggest inferential implications are for the white coefficient, but do note the t-statistics for single and metro dropped considerably as well.

You could also go back to some basics and acknowledge that outliers (i.e. DC) and right-skewed data are responsible for flunking the assumption of homoskedastic error variances. We could transform the data to impose some kind of normality here. First, let’s check if we have any 0s here.

af_crime93 %>% summary
##     state              violent           murder          poverty     
##  Length:51          Min.   :  82.0   Min.   : 1.600   Min.   : 8.00  
##  Class :character   1st Qu.: 326.5   1st Qu.: 3.900   1st Qu.:10.70  
##  Mode  :character   Median : 515.0   Median : 6.800   Median :13.10  
##                     Mean   : 612.8   Mean   : 8.727   Mean   :14.26  
##                     3rd Qu.: 773.0   3rd Qu.:10.350   3rd Qu.:17.40  
##                     Max.   :2922.0   Max.   :78.500   Max.   :26.40  
##      single          metro            white         highschool   
##  Min.   : 8.40   Min.   : 24.00   Min.   :31.80   Min.   :64.30  
##  1st Qu.:10.05   1st Qu.: 49.55   1st Qu.:79.35   1st Qu.:73.50  
##  Median :10.90   Median : 69.80   Median :87.60   Median :76.70  
##  Mean   :11.33   Mean   : 67.39   Mean   :84.11   Mean   :76.22  
##  3rd Qu.:12.05   3rd Qu.: 83.95   3rd Qu.:92.60   3rd Qu.:80.10  
##  Max.   :22.10   Max.   :100.00   Max.   :98.50   Max.   :86.60  
##       wts           
##  Min.   :9.478e-06  
##  1st Qu.:4.488e-05  
##  Median :5.726e-05  
##  Mean   :6.952e-05  
##  3rd Qu.:7.713e-05  
##  Max.   :1.988e-04

We don’t. So, let’s log-transform the DV and regress that on the x variables.

summary(M9 <- lm(log(violent) ~ poverty + single + metro + white + highschool, data=af_crime93))
## 
## Call:
## lm(formula = log(violent) ~ poverty + single + metro + white + 
##     highschool, data = af_crime93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87452 -0.21997  0.00369  0.29477  0.70716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.862626   1.391604   2.776   0.0080 ** 
## poverty      0.029121   0.023062   1.263   0.2132    
## single       0.126712   0.042365   2.991   0.0045 ** 
## metro        0.017229   0.002695   6.393 8.14e-08 ***
## white       -0.001451   0.005783  -0.251   0.8030    
## highschool  -0.007196   0.016284  -0.442   0.6607    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3749 on 45 degrees of freedom
## Multiple R-squared:  0.7336, Adjusted R-squared:  0.704 
## F-statistic: 24.79 on 5 and 45 DF,  p-value: 6.633e-12
bptest(M9) 
## 
##  studentized Breusch-Pagan test
## 
## data:  M9
## BP = 9.221, df = 5, p-value = 0.1006

And we passed. Barely at the .10 level, but I’ll count that. Here, log-transforming the DV helped considerably. The log-transformed DV does have the effect of suggesting there is no poverty effect.

Finally, we can do what I think is one of the greatest parlor tricks in statistics: bootstrapping. I’m going to have to punt on the theory and exact principles of bootstrapping to classic texts, but the intuition here is, in the absence of a true “population”, to treat the sample as a population of sorts and sample, with replacement from that. There’s a lot more to say about this that I don’t have the space here. If you have some complicated data structures, you’ll need to adjust your bootstrapping to account for that.

The {modelr} package will make the bootstrapping simpler for us. First, let’s sample, with replacement, from the af_crime93 data 1,000 times. The data we have are small, so we can’t get too crazy. Efron I think recommends 250-500, but 1,000 is a nice number that is often a nice target in the sampling framework.

set.seed(8675309) # Jenny, I got your number...

af_crime93 %>%
  bootstrap(1000) -> bootCrime

Notice we have 1,000 resamples of the original data.

bootCrime
## # A tibble: 1,000 x 2
##    strap      .id  
##    <list>     <chr>
##  1 <resample> 0001 
##  2 <resample> 0002 
##  3 <resample> 0003 
##  4 <resample> 0004 
##  5 <resample> 0005 
##  6 <resample> 0006 
##  7 <resample> 0007 
##  8 <resample> 0008 
##  9 <resample> 0009 
## 10 <resample> 0010 
## # … with 990 more rows

Here’s the first one. Interestingly, Colorado appears in this five times. That seems like it’s oversampling Colorado, which it would be if this were our only resample. However, there will be cases where Colorado doesn’t appear at all. That’s the idea here.

bootCrime %>% 
  slice(1) %>% # grab the first row in this special tbl
  pull(strap) %>% # focus on the resample, momentarily a full list
  as.data.frame() %>% # cough up the full data
  as_tibble() %>%
  arrange(state) %>% # arrange by state, for ease of reading
  select(-murder, -wts)
## # A tibble: 51 x 7
##    state violent poverty single metro white highschool
##    <chr>   <dbl>   <dbl>  <dbl> <dbl> <dbl>      <dbl>
##  1 AL        780   17.4    11.5  67.4  73.5       66.9
##  2 AR        593   20      10.7  44.7  82.9       66.3
##  3 CA       1078   18.2    12.5  96.7  79.3       76.2
##  4 CO        567    9.90   12.1  81.8  92.5       84.4
##  5 CO        567    9.90   12.1  81.8  92.5       84.4
##  6 CO        567    9.90   12.1  81.8  92.5       84.4
##  7 CO        567    9.90   12.1  81.8  92.5       84.4
##  8 CO        567    9.90   12.1  81.8  92.5       84.4
##  9 CT        456    8.5    10.1  95.7  89         79.2
## 10 DE        686   10.2    11.4  82.7  79.4       77.5
## # … with 41 more rows

Now, let’s use {purrr} magic and run our regression on each of these resamples. Then, we’re going to tidy them up, pull the results, and create a nice tidy data frame of those.

bootCrime %>% 
  mutate(lm = map(strap, ~lm(violent ~ poverty + single + metro + white + highschool, 
                             data = .)),
         tidy = map(lm, broom::tidy)) %>%
  pull(tidy) %>%
  map2_df(., # map to return a data frame
          seq(1, 1000), # make sure to get this seq right. We did this 1000 times.
          ~mutate(.x, resample = .y)) -> tidybootCrime

Now, in a case like this, your bootstrapped regression parameter (i.e. coefficient or intercept) is the mean of the estimate and your bootstrapped standard error is the standard deviation of the estimate. That part may not be obvious. It’s not the mean of the standard errors, it’s the standard deviation of the coefficient.

tidybootCrime %>%
  group_by(term) %>%
  summarize(boot_est = mean(estimate),
            boot_se = sd(estimate)) -> bootsums

bootsums
## # A tibble: 6 x 3
##   term        boot_est boot_se
## * <chr>          <dbl>   <dbl>
## 1 (Intercept) -1377.    798.  
## 2 highschool      8.29    8.93
## 3 metro           7.27    1.59
## 4 poverty        25.3    11.7 
## 5 single         93.5    34.3 
## 6 white          -6.54    5.39

You can compare the results of this bootstrapped summary to your “main” model. FWIW, you can bootstrap the coefficient as well but most practical interest is in what bootstrapping does in comparison to the standard error of your OLS model. So, it might make more sense to just compare the standard errors.

broom::tidy(M5) %>%
  mutate(cat = "OLS") %>%
  bind_rows(., bootsums %>% rename(estimate = boot_est, `std.error` = boot_se) %>% mutate(cat = "Bootstrapped")) %>%
  arrange(term) %>%
  mutate(statistic = ifelse(is.na(statistic), estimate/std.error, statistic)) %>%
  select(-p.value)
## # A tibble: 12 x 5
##    term        estimate std.error statistic cat         
##    <chr>          <dbl>     <dbl>     <dbl> <chr>       
##  1 (Intercept) -1796.      669.      -2.69  OLS         
##  2 (Intercept) -1377.      798.      -1.72  Bootstrapped
##  3 highschool      8.65      7.83     1.10  OLS         
##  4 highschool      8.29      8.93     0.929 Bootstrapped
##  5 metro           7.61      1.30     5.87  OLS         
##  6 metro           7.27      1.59     4.58  Bootstrapped
##  7 poverty        26.2      11.1      2.37  OLS         
##  8 poverty        25.3      11.7      2.17  Bootstrapped
##  9 single        109.       20.4      5.38  OLS         
## 10 single         93.5      34.3      2.72  Bootstrapped
## 11 white          -4.48      2.78    -1.61  OLS         
## 12 white          -6.54      5.39    -1.21  Bootstrapped

In this case, bootstrapping leads to much more diffuse standard errors for the white variable and, to a lesser extent, the single-parent household variable. The latter is not affected from a null hypothesis test standpoint.

The superlative of this approach is that it’s super-flexible and you can do a lot of cool things. Downsides: it can be a bit time-consuming with larger data sets. Data sets that would violate the independence of errors would require additional considerations.

No matter, nothing here is necessarily a “fix.” When you have heteroskedasticity, refit the model a few ways and see what happens to your standard errors. Be prepared to report this in an appendix.