Abstract
This is a lab script for POST 8000, a graduate-level quantitative methods for public policy class that I teach at Clemson University. It will not be the most sophisticated R-related write-up of mine—check my blog for those—but it should be useful for discussion around the associated R script for the week’s ‘lab’ session.::opts_chunk$set(warning=FALSE, message=FALSE) knitr
Some of what I write here is taken from my website. Please read these:
Be advised I might plagiarize myself here.
Packages will be fairly minimal for this lab session. {tidyverse}
will be doing most things. {stevedata}
has our data. {lmtest}
has some diagnostics for us. Let’s load those now. The {knitr}
chunk above will make the code output less verbose.
library(tidyverse)
library(stevedata)
library(lmtest)
TV16
comes from the 2016 Cooperative Congressional Election Study (CCES) data. It allows us to model the individual-level correlates of the self-reported Trump vote in 2016 using data from all 50 states, stratified on state population and other demographics. Importantly, there are more Californians than South Carolinians so data of this sort should (and here: does) reflect that.
?TV16 TV16
## # A tibble: 64,600 x 21
## uid state votetrump age female collegeed racef famincr ideo pid7na
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 New … 1 47 1 0 White NA 3 5
## 2 2 Loui… 1 22 1 0 White 6 3 4
## 3 3 Miss… NA 52 1 0 Black 4 5 1
## 4 4 Alab… NA 28 1 0 Black 1 4 4
## 5 5 Colo… 0 34 1 1 White 7 2 2
## 6 6 Alab… NA 53 1 0 Mixed 1 NA 2
## 7 7 Texas 1 54 0 0 White 3 5 6
## 8 8 Penn… NA 25 1 0 White 1 3 4
## 9 9 Geor… 1 53 0 0 White 4 4 7
## 10 10 Penn… 1 59 1 0 White 7 3 7
## # … with 64,590 more rows, and 11 more variables: bornagain <dbl>,
## # religimp <dbl>, churchatd <dbl>, prayerfreq <dbl>, angryracism <dbl>,
## # whiteadv <dbl>, fearraces <dbl>, racerare <dbl>, lrelig <dbl>,
## # lcograc <dbl>, lemprac <dbl>
We’ll start simple. Let’s start by looking at just white voters in what was a crucial swing state in that election: Pennsylvania.
%>%
TV16 filter(state == "Pennsylvania" & racef == "White") -> Penn
Let’s try a simple linear model for this. Herein, let’s estimate a white person’s vote for Trump (i.e. 1 = they did, 0 = they did not) as a function of age, gender, education (if they have a college diploma), household income, partisanship (D to R), ideology (L to R), and whether they’re a born-again Christian. Let’s get a summary too.
<- lm(votetrump ~ age + female + collegeed + famincr +
M1 + ideo + bornagain,
pid7na data = Penn, na.action=na.exclude)
summary(M1)
##
## Call:
## lm(formula = votetrump ~ age + female + collegeed + famincr +
## pid7na + ideo + bornagain, data = Penn, na.action = na.exclude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1200 -0.1848 0.0051 0.1942 1.0275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2640879 0.0432034 -6.113 1.20e-09 ***
## age 0.0010567 0.0005351 1.975 0.0484 *
## female -0.0250871 0.0164162 -1.528 0.1266
## collegeed -0.1058383 0.0183746 -5.760 9.86e-09 ***
## famincr -0.0023557 0.0029267 -0.805 0.4210
## pid7na 0.1179025 0.0046564 25.321 < 2e-16 ***
## ideo 0.1024457 0.0101437 10.099 < 2e-16 ***
## bornagain 0.0263359 0.0200398 1.314 0.1890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3377 on 1813 degrees of freedom
## (1080 observations deleted due to missingness)
## Multiple R-squared: 0.5453, Adjusted R-squared: 0.5435
## F-statistic: 310.6 on 7 and 1813 DF, p-value: < 2.2e-16
Your interpretation of this summary model will suggest that there are “statistically significant” effects of age (+), college education (-), D to R partisanship (+, which, no shit), and L to C ideology (+, which, again, no shit).
What you (well: I) did here is euphemistically called the “linear probability model.” Its treatment often comes wrapped in lots of exposition and justification but the model itself reduces to the imposition of an ordinary least squares (OLS) model that regresses a binary dependent variable on a set of covariates. The interpretation here is that the regression coefficients can be interpreted as the change in the probability that y
= 1 holding constant the other regressors. Thus, looking at the ideology coefficient, a one-unit change in the ideology variable increases the probability of an individual-level Trump vote among these white Pennsylvanians by about .102.
You might have been able to gather from the lecture by this point that I hate this intepretation for reasons I hope to elaborate throughout this lab script. No matter: remember that we can assess the veracity of the OLS model through various diagnostic procedures.
Let’s check the linearity assumption.
plot(M1, which=1)
Okay, fitted-residual plots should not look like this. We’re already off to a not-good start.
Let’s test for potential auto-correlation.
bgtest(M1)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: M1
## LM test = 0.63752, df = 1, p-value = 0.4246
dwtest(M1)
##
## Durbin-Watson test
##
## data: M1
## DW = 1.9605, p-value = 0.1972
## alternative hypothesis: true autocorrelation is greater than 0
We’re golden here, pony boy. We did select on just white people in Pennyslvania. The moment we start lumping in people from other states or people from other races is when we’re probably going to get some important spatial heterogeneity here. No matter, we’ve satisfied the independence assumption with this small, focused sample.
Okay, let’s check for heteroskedasticity with the Breusch-Pagan test.
bptest(M1)
##
## studentized Breusch-Pagan test
##
## data: M1
## BP = 19.268, df = 7, p-value = 0.007389
Yeah, no shit you have heteroskedastic errors. You only have two possible responses! Under these conditions, you should expect that your errors are no longer normally distributed. The Shapiro-Wilk test, the Kolmogorov-Smirnov test, and the Q-Q plot will all tell you this.
shapiro.test(resid(M1))
##
## Shapiro-Wilk normality test
##
## data: resid(M1)
## W = 0.98705, p-value = 9.872e-12
ks.test(resid(M1), y=pnorm)
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(M1)
## D = 0.24783, p-value < 2.2e-16
## alternative hypothesis: two-sided
plot(M1, which=2)
To be clear, the normality assumption is about the errors and not the dependent variable, even as the normality assumption of the errors does kind of imply that the conditional distribution of the dependent variable is normal.
No matter, I want you to take stock of what happened here: the fitted-residual plot is a mess, the errors are not normally distributed, and the errors are heteroskedastic. That’s on you (well: me) for running an OLS model on a binary dependent variable and asking it to make sense.
An econometrician may object that this is still fine. After all, the normality assumption is a strong one and the implication for violating it is not as severe as the other ones. Further, heteroskedasticity is more about the variance of the estimator and not the estimate itself. Further, the logistic curve is kind of linear in the middle. Under those conditions, you can still think of the coefficients as communicating changes in probability.
My retort to that is that it’s not clear to me you necessarily should when the fitted values from the model will break the bounds of probability. Observe:
%>%
Penn mutate(yhat = predict(M1)) %>%
ggplot(.,aes(yhat, votetrump)) +
geom_point(alpha=0.2, color="black") +
geom_vline(xintercept = 1, linetype="dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
annotate("text", label="Impossible\npredictions\n(yhat > 1)",
x = 1.05, y=.5, hjust="left") +
annotate("text", label="Impossible\npredictions\n(yhat < 0)",
x = -.05, y=.5, hjust="right")
I’m willing to be flexible and open to interpretation about linear models on things like four-item DVs (definitely don’t), five-item Likerts (enh, also don’t do that…) or seven-item ordinal variables (I’m listening…), or ordinal items on a larger scale like 10 points (about where I’d be).
Under those circumstances, check your y
-hats. Anything outside the bounds of the scale should caution you against using OLS with the nature of the dependent variable you have. When all y
-hats are within that scale, you might spin a ball of yarn to me about how the discrete nature of the DV are our only measurements of some underlying continuum. I’m willing to hear those out, under those circumstances. However, the answer, for me, is always NO when it comes to linear models on binary DVs. Instead, use a model designed for the distribution of the dependent variable. You can use a probit model, though I prefer the logistic regression and will discuss it here.
You can fit a logistic regression with the glm()
function. Specify the family you want, which will be binomial(link="logit")
.
<- glm(votetrump ~ age + female + collegeed + famincr +
M2 + ideo + bornagain,
pid7na data = Penn, na.action=na.exclude,
family = binomial(link="logit"))
summary(M2)
##
## Call:
## glm(formula = votetrump ~ age + female + collegeed + famincr +
## pid7na + ideo + bornagain, family = binomial(link = "logit"),
## data = Penn, na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9931 -0.4634 0.1438 0.4714 2.7298
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.601772 0.447840 -12.508 < 2e-16 ***
## age 0.009862 0.004897 2.014 0.0440 *
## female -0.169867 0.148032 -1.148 0.2512
## collegeed -0.929730 0.169372 -5.489 4.04e-08 ***
## famincr -0.025144 0.026438 -0.951 0.3416
## pid7na 0.705843 0.040902 17.257 < 2e-16 ***
## ideo 0.930848 0.097958 9.502 < 2e-16 ***
## bornagain 0.310911 0.181053 1.717 0.0859 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2522.4 on 1820 degrees of freedom
## Residual deviance: 1274.4 on 1813 degrees of freedom
## (1080 observations deleted due to missingness)
## AIC: 1290.4
##
## Number of Fisher Scoring iterations: 5
Let’s compare the two models. We’ll suppress the intercepts to focus on the important stuff.
::tidy(M1) %>%
broommutate(model = "LPM") %>%
bind_rows(., broom::tidy(M2) %>% mutate(model = "Logit")) %>%
arrange(term) %>% filter(term != "(Intercept)")
## # A tibble: 14 x 6
## term estimate std.error statistic p.value model
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 age 0.00106 0.000535 1.97 4.84e- 2 LPM
## 2 age 0.00986 0.00490 2.01 4.40e- 2 Logit
## 3 bornagain 0.0263 0.0200 1.31 1.89e- 1 LPM
## 4 bornagain 0.311 0.181 1.72 8.59e- 2 Logit
## 5 collegeed -0.106 0.0184 -5.76 9.86e- 9 LPM
## 6 collegeed -0.930 0.169 -5.49 4.04e- 8 Logit
## 7 famincr -0.00236 0.00293 -0.805 4.21e- 1 LPM
## 8 famincr -0.0251 0.0264 -0.951 3.42e- 1 Logit
## 9 female -0.0251 0.0164 -1.53 1.27e- 1 LPM
## 10 female -0.170 0.148 -1.15 2.51e- 1 Logit
## 11 ideo 0.102 0.0101 10.1 2.28e- 23 LPM
## 12 ideo 0.931 0.0980 9.50 2.05e- 21 Logit
## 13 pid7na 0.118 0.00466 25.3 2.29e-121 LPM
## 14 pid7na 0.706 0.0409 17.3 9.95e- 67 Logit
Just because almost everything that was significant in M1
is significant in M2
is beside the point. That also glosses over an interesting change in the discernibility of the bornagain
variable against a counterclaim of zero effect. Peeking ahead, the “divide by 4” rule for logistic regression suggests the linear probability model may be understating the actual effect of some variables like ideology and partisanship. Ultimately: hunting for statistical significance is no justification for taking the linear model over one better suited for modeling a binary dependent variable.
I think the biggest hangup for beginners and some practitioners is that OLS communicates what you want to know in a more straightforward way: a coefficient communicates an estimated change in the value of y
for a one-unit change in x
. However, the logistic regression returns coefficients as something called “logits” (i.e. natural logged odds). How might you get more usable information from this model?
I’ll save the “divide by 4” rule for second in this list, but I want to unpack the logistic regression model on its own terms here. Let’s start with the first thing you can do.
The intuition here is simple: you have a logit. You want a probability. Inverse the logit as an odds ratio to get that probability you want. My only misgiving with this is that it’s not so simple as monkeying with the coefficient. You’ll need the intercept as well, along with anything else in your model. Thus, you should resist the urge to do this manually with more complicated models that have multiple covariates. You will make errors. I’ll show you how to do this in the “making most of regression” lecture.
Nevertheless, let’s do a simple exercise here with just one covariate: the college education variable. FWIW, I selected just this one covariate because it too is binary so the calculation is simple. Further, the coefficient is not too far removed from what it is in the fully specified model.
<- glm(votetrump ~ collegeed, data=Penn, na.action = na.exclude,
M3 family = binomial(link="logit"))
summary(M3)
##
## Call:
## glm(formula = votetrump ~ collegeed, family = binomial(link = "logit"),
## data = Penn, na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.338 -1.338 1.025 1.025 1.373
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.37029 0.05451 6.793 1.1e-11 ***
## collegeed -0.81812 0.09339 -8.760 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2940.3 on 2123 degrees of freedom
## Residual deviance: 2861.8 on 2122 degrees of freedom
## (777 observations deleted due to missingness)
## AIC: 2865.8
##
## Number of Fisher Scoring iterations: 4
Let’s tidy up our model now, extracting the information we want.
::tidy(M3) -> tidyM3
broom<- tidyM3[1, 2] %>% pull()
interceptM3 <- tidyM3[2, 2] %>% pull() coefM3
^ Btw, this is why I’m telling you to resist the urge to do this manually for full models, ESPECIALLY when you you don’t have naturally occurring zeroes in most of the parameters.
Now, let’s calculate it manually. What is the estimated difference in probability of voting for Trump for a white Pennsylvanian with a college education versus a white Pennsylvanian without a college education? The plogis()
function comes to the rescue here.
plogis((interceptM3 + coefM3)) - plogis(interceptM3)
## [1] -0.2016522
Interpretation: having a college education decreases the probability of voting for Trump among Pennsylvania whites by about 20 percentage points. Some caveats:
plogis(coefM3)
## [1] 0.3061629
That would be wrong.
Consider what I just offered a tentative estimate too. Truly, esp. for GLMs, quantities of interest are what you need to more fully communicate the effects you want. We’ll get to that in a few weeks.
I love parlor tricks in statistics and I think this is a fantastic one. To the best of my knowledge, we have Gelman and Hill (2006, 82) to thank for this. Understanding it requires know a little bit about the logistic curve. You’ve seen this before; you just might have called it the “S” curve.
<- function(x){
scurve <- exp(x) / (1 + exp(x))
y return(y)
}
tibble(x = seq(-6, 6)) %>%
ggplot(.,aes(x)) +
stat_function(fun = scurve)
Observe that the logistic curve is steepest in the middle/center. That means the slope of the curve (i.e. the derivative of the logistic function) is maximized at the point where Be^0/(1 + e^0)^2
. That, incidentally, resolves to B/4
. Thus, B/4
(or: your regression coefficient, divided by 4) is the estimated maximum difference in p(y = 1)
for a one-unit change in x
. It’s an upper bound of the predictive difference, albeit one inapplicable to the constant.
For what it’s worth, I think this an amazing rule of thumb.
plogis((interceptM3 + coefM3)) - plogis(interceptM3)
## [1] -0.2016522
/4 coefM3
## [1] -0.20453
It’s not perfect, but it’s really good. We can apply this to the full model (M2
) as well.
::tidy(M2) %>%
broomselect(1:3) %>%
mutate(db4 = estimate/4) %>%
slice(-1) # does not apply to constant/intercept
## # A tibble: 7 x 4
## term estimate std.error db4
## <chr> <dbl> <dbl> <dbl>
## 1 age 0.00986 0.00490 0.00247
## 2 female -0.170 0.148 -0.0425
## 3 collegeed -0.930 0.169 -0.232
## 4 famincr -0.0251 0.0264 -0.00629
## 5 pid7na 0.706 0.0409 0.176
## 6 ideo 0.931 0.0980 0.233
## 7 bornagain 0.311 0.181 0.0777
This is obviously not a place to stop in terms of communicating your logistic regression coefficients. However, I think it’s a great place to start. Truthfully, when I evaluate my own logistic regression models, this is typically the second thing I’m looking at (after evaluating statistical significance).
Odds ratios may also be useful ways of summarizing a logistic regression. I encourage you to get flexible with logits, but odds ratios have an advantage over probabilities because they can scale up infinitely. Probabilities are left-bound and right-bound. Odds and odds ratios are only left-bound. Odds/odds ratios less than 1 are negative effects.
FWIW, I wince every time I see logistic regression models communicated with odds ratios. Namely, if you’re arguing for a negative effect, the odds ratio kind of obscures that because they’re left-bound at 0. I prefer to communicate them on their original scale and encourage you to get comfortable with the same.
Yet, here’s how you could do it. Just exponentiate what the logistic regression model gives you. Let’s also add confidence intervals for funsies.
::tidy(M2) %>%
broombind_cols(., as_tibble(confint(M2))) %>%
mutate_at(vars(2:ncol(.)), exp)
## # A tibble: 8 x 7
## term estimate std.error statistic p.value `2.5 %` `97.5 %`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.00369 1.56 3.70e-6 1 0.00150 0.00871
## 2 age 1.01 1.00 7.49e+0 1.04 1.00 1.02
## 3 female 0.844 1.16 3.17e-1 1.29 0.631 1.13
## 4 collegeed 0.395 1.18 4.13e-3 1.00 0.282 0.549
## 5 famincr 0.975 1.03 3.86e-1 1.41 0.926 1.03
## 6 pid7na 2.03 1.04 3.12e+7 1 1.87 2.20
## 7 ideo 2.54 1.10 1.34e+4 1 2.10 3.08
## 8 bornagain 1.36 1.20 5.57e+0 1.09 0.958 1.95
The point is well-taken that my code is clumsily exponentiating the p-value as well. You should not do that. I’m only doing this to show you can exponentiate the logit as an odds ratio, if that is more flexible and intuitive for you. For me, they’re not. I’ve always found logits more intuitive than odds ratios. I encourage getting more comfortable with logits too.
There are various tools for model checking the logistic regression. The application differs from OLS but the approaches should seem familiar.
First, you can get a binned residual plot. With OLS, you compare the fitted values and residuals with a scatterplot. With logistic regression, you get something similar, but recall the residuals are discrete because the dependent variable is discrete.
First, let’s get the fitted values and residuals, both as type “response” (i.e. probabilities of a 1).
%>%
Penn mutate(fitted = predict(M2, type="response"),
resid = residuals(M2, type = "response")) -> Penn
Gelman and Hill have a rule of thumb for determining the size of the bins. For larger data sets (n >= 100, which we have here), the number of bins are determined as the (rounded down, i.e. “floored”) square root of number of obs in the model. Then, across these number of bins, we’re going to look for the corresponding obs location in the fitted values. The breaks that emerge will coincide with those fitted values.
<- floor(sqrt(nobs(M2)))
nbins <- floor(length(M2$fitted.values) * (1:(nbins - 1)) / nbins)
breaks.index <- unique(c(-Inf, sort(M2$fitted.values)[breaks.index], Inf)) breaks
In the interest of full disclosure: I need to think about how I want to make this more flexible. This is a Steve problem for future classes, but this will converge on what arm::binnedplot()
will give you. There’s also parameters::binned_residuals()
, though that has additional package requirements (i.e. {parameters}
, {performance}
, {easystats}
, and {see}
, among, I’m sure, others). My goal here is to get you to install as few packages as possible, even if the code that emerges can be a bit convoluted.
Okay then, let’s create some bins.
%>% select(fitted, resid) %>% na.omit %>%
Penn mutate(bin = as.numeric(cut(M2$fitted.values, breaks))) %>%
group_by(bin) %>%
summarize(min = min(fitted, na.rm=T),
max = max(fitted, na.rm=T),
n = n(),
meanfit = mean(fitted, na.rm=T),
meanresid = mean(resid, na.rm=T),
sdresid = sd(resid, na.rm=T),
se = 1.96*sdresid/sqrt(n),
inbounds = ifelse(between(meanresid, -se, se), 1, 0)) -> Bins
Let’s check to see where we’re out of bounds. I’ve always been told the ideal interpretation here is that you want 95% of your bins “in bounds.” We won’t have that here, it seems. Let’s see where:
%>%
Bins filter(inbounds == 0)
## # A tibble: 10 x 9
## bin min max n meanfit meanresid sdresid se inbounds
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.00627 0.0100 43 0.00796 -0.00796 0.00104 0.000312 0
## 2 2 0.0100 0.0187 43 0.0158 -0.0158 0.00278 0.000830 0
## 3 3 0.0188 0.0229 45 0.0209 -0.0209 0.00116 0.000340 0
## 4 8 0.0567 0.0658 43 0.0613 -0.0613 0.00296 0.000885 0
## 5 20 0.411 0.486 44 0.444 -0.194 0.435 0.129 0
## 6 23 0.609 0.671 44 0.640 -0.185 0.502 0.148 0
## 7 25 0.724 0.757 43 0.737 0.170 0.294 0.0878 0
## 8 38 0.964 0.969 43 0.966 0.0336 0.00131 0.000391 0
## 9 39 0.969 0.975 43 0.972 0.0278 0.00181 0.000542 0
## 10 41 0.981 0.988 43 0.984 0.0155 0.00258 0.000770 0
It does look like our model isn’t a great fit at 1) the tail ends of the distribution where the probabilities are touching 0 or 1. We have a few bins in the middle as well. FWIW, don’t fret the outliers that close to the bounds of 0 and 1. Not ideal, but not terribly problematic.
You could importantly plot them as well. This will be about what arm::binnedplot()
is doing.
%>%
Bins ggplot(.,aes(meanfit, meanresid)) + geom_point() +
geom_smooth() +
geom_ribbon(aes(ymin = -Inf, ymax = -se), alpha = .1 , fill = "red") +
geom_ribbon(aes(ymin = se, ymax = Inf), alpha = .1 , fill = "red") +
geom_line(aes(y = se), colour = "grey70") +
geom_line(aes(y = -se), colour = "grey70") +
geom_hline(yintercept = 0, linetype = "dashed")
When you see a plot like this, check for a pattern. There really isn’t much of one, though there does seem to be some bowing in the middle. I don’t see a huge problem here, though I do see a model that could use some further explanation and consideration. Basically, if you’re worried your model might not be the best fit given what you see from this plot, consider the following points.
Basically: think more about your data here. I did not. You should.
You can evaluate a logistic regression model by checking for predictive accuracy. You will see this done a lot in “machine learning” applications. Herein, the model is “checked” not for fit, per se, but for “validation.”
First, notice how our fitted values are predicted probabilities of observing a 1? Let’s create two summaries to see how well our model is understanding the Trump vote. In the first case, naive
is just a naive error rate that comes from knowing the mode/median. We know there are more 1s than 0s in the data. Indeed, the mean is about .52 (which, for binary variables, communicates the proportion of 1s). If we just picked 1 for all respondents, knowing nothing else, we’d be right about 52% of the time. We’d be wrong about 48% of the time.
In the second case, errorrate
is the proportion of cases where we predicted a Trump vote, given the model, but were wrong (or the times we predicted the Pennsylvania white wouldn’t vote for Trump, but did). Do note: our cutoff for making a guess from the model output here is .5. Machine learning people I know say you can (and perhaps should) adjust that threshold to do validation of a model. Here, we’ll just do .5.
%>%
Penn select(votetrump, fitted) %>%
summarize(naive = 1 - mean(votetrump, na.rm=T),
errorrate = mean((fitted > .5 & votetrump == 0) | fitted <.5 & votetrump == 1, na.rm=T))
## # A tibble: 1 x 2
## naive errorrate
## <dbl> <dbl>
## 1 0.478 0.150
As you evaluate this kind of output, keep the following in mind. You should want your error rate to be smaller than .5, much smaller even. If your error rate were greater than .5, you could simply set all betas to 0 and get a better fitting model. Here, I think we’re doing pretty damn good, all things considered. We’re correctly predicting about 85% of the cases.
Recall: deviance is a measure of error and replaces R2 as a measure of model fit in a GLM like a logistic regression. It’s analogous to the residual standard deviation. Lower deviance is better: it means your fitted values better “fit” to the observed values. Much like R-square, though, it decreases the more ingredients you throw in the proverbial broth. You could overfit a model, which you shouldn’t.
One way of thinking about this: if you added a variable that was 100% random noise, the deviance would decrease by 1, on average. If you added a variable that was informative/useful, the deviance would decrease by more than 1. This is additive: adding k
predictors to a model should decrease the deviance by k
(or more).
Let’s illustrate this. First, let’s add a completely uninformative variable.
set.seed(8675309) # Jenny, I got your number...
%>%
Penn mutate(noise = rnorm(n())) -> Penn
<- update(M2,. ~ . + noise)
M4 summary(M4)
##
## Call:
## glm(formula = votetrump ~ age + female + collegeed + famincr +
## pid7na + ideo + bornagain + noise, family = binomial(link = "logit"),
## data = Penn, na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9851 -0.4645 0.1443 0.4663 2.7295
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.602965 0.447901 -12.509 < 2e-16 ***
## age 0.009866 0.004898 2.014 0.044 *
## female -0.169831 0.148028 -1.147 0.251
## collegeed -0.932223 0.169803 -5.490 4.02e-08 ***
## famincr -0.024928 0.026460 -0.942 0.346
## pid7na 0.705630 0.040907 17.249 < 2e-16 ***
## ideo 0.931063 0.097985 9.502 < 2e-16 ***
## bornagain 0.311951 0.181092 1.723 0.085 .
## noise 0.014943 0.070290 0.213 0.832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2522.4 on 1820 degrees of freedom
## Residual deviance: 1274.4 on 1812 degrees of freedom
## (1080 observations deleted due to missingness)
## AIC: 1292.4
##
## Number of Fisher Scoring iterations: 5
^ Notice: the noise variable adds nothing of value to the model. How much did we reduce the deviance in the model by doing this?
$deviance - M4$deviance M2
## [1] 0.04520731
The difference is less than 1. It’s an almost zero reduction in the deviance. This is unsurprising: the variable is noise.
Let’s do something a bit more substantive. What if we elected to not model partisanship? We know from the literature that partisan identity is the largest driver of a partisan vote in the United States. Self-identified Republicans vote for Republican candidates. The stronger the attachment, the greater the likelihood of a partisan vote.
So let’s set this question up as an empirical one:
Let’s see. For one, because I’ve been playing fast and loose with passing over missing data, let’s standardize the data sets to drop NAs. The ANOVA and likelihood ratio test will complain if I don’t do this.
%>% select(votetrump, age, female, collegeed, famincr, ideo, pid7na, bornagain) %>%
Penn -> Penn
na.omit
# Re-run M2
<- glm(votetrump ~ age + female + collegeed + famincr +
M2 + ideo + bornagain,
pid7na data = Penn, na.action=na.exclude,
family = binomial(link="logit"))
Now, here’s a simple update of the model to take out partisanship.
<- update(M2,. ~ . -pid7na)
M5 summary(M5)
##
## Call:
## glm(formula = votetrump ~ age + female + collegeed + famincr +
## ideo + bornagain, family = binomial(link = "logit"), data = Penn,
## na.action = na.exclude)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8640 -0.6572 0.2089 0.6327 2.6728
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.974603 0.389850 -12.760 < 2e-16 ***
## age -0.001498 0.004138 -0.362 0.7174
## female -0.252876 0.125632 -2.013 0.0441 *
## collegeed -0.856604 0.142582 -6.008 1.88e-09 ***
## famincr 0.014652 0.022594 0.648 0.5167
## ideo 1.722047 0.088898 19.371 < 2e-16 ***
## bornagain 0.387208 0.154830 2.501 0.0124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2522.4 on 1820 degrees of freedom
## Residual deviance: 1654.2 on 1814 degrees of freedom
## AIC: 1668.2
##
## Number of Fisher Scoring iterations: 5
Let’s compare the deviances.
$deviance - M2$deviance M5
## [1] 379.7871
This is quite the change in deviance! How might an ANOVA summarize this?
anova(M5, M2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: votetrump ~ age + female + collegeed + famincr + ideo + bornagain
## Model 2: votetrump ~ age + female + collegeed + famincr + pid7na + ideo +
## bornagain
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1814 1654.2
## 2 1813 1274.4 1 379.79 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It’s suggesting what you should know by now. Partisanship matters a whole lot in understanding partisan votes in elections, especially for president. Your model will perform worse if you don’t include a partisanship measure. That’s what the ANOVA is telling you here.
A likelihood ratio test will tell you the same thing, btw, with the exact same statistic. This comes by way of the {lmtest}
package.
lrtest(M5, M2)
## Likelihood ratio test
##
## Model 1: votetrump ~ age + female + collegeed + famincr + ideo + bornagain
## Model 2: votetrump ~ age + female + collegeed + famincr + pid7na + ideo +
## bornagain
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 7 -827.12
## 2 8 -637.22 1 379.79 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The major thing to note here is that this a likelihood ratio test. I had previously been discussing the deviance. Incidentally, the log likelihood of a GLM * -2 returns the deviance. If you ever see, anywhere in your travels, a statistic with a shorthand of “-2LL”, that’s what that is. It’s also communicating the deviance.
FWIW, few people (in my experience) ask to see ANOVA. You can pitch model comparison here as an error rate stat.
%>%
Penn mutate(fittedM2 = fitted(M2, type="response")) %>%
mutate(fittedM5 = fitted(M5, type="response")) %>%
select(votetrump, fittedM2, fittedM5 ) %>%
summarize(meandv = 1 - mean(votetrump, na.rm=T),
errorrateM2 = mean((fittedM2 > .5 & votetrump == 0) | fittedM2 <.5 & votetrump == 1, na.rm=T),
errorrateM5 = mean((fittedM5 > .5 & votetrump == 0) | fittedM5 <.5 & votetrump == 1, na.rm=T))
## # A tibble: 1 x 3
## meandv errorrateM2 errorrateM5
## <dbl> <dbl> <dbl>
## 1 0.483 0.150 0.221
Implications: M2
is a better fit than M5
deviance-wise, which we established in the likelihood ratio test and the ANOVA. In terms of predictive accuracy, M2
is preferable to M5
as well. The error rate in the full model + with partisanship is .15. The error rate without partisanship increases to .221.