Abstract
This is a lab script for POST 8000, a graduate-level quantitative methods for public policy class that I teach at Clemson University. It will not be the most sophisticated R-related write-up of mine—check my blog for those—but it should be useful for discussion around the associated R script for the week’s ‘lab’ session.I’m pretty sure we’ve yet to install the {ordinal}
package, which I think is the best R package for handling ordinal models of all walks. The {MASS}
package has a polr()
function, and lots of other goodies, but a few things in it conflict with my preferred workflow. Plus, I think {ordinal}
just has more goodies for ordinal models. I’ve already installed it, but let’s wrap it in a simple function that will install it on your end if you’ve yet to install it. As always, {tidyverse}
has our main workflow functions and {stevedata}
has our data.
<- function(packages) {
if_not_install <- packages[!(packages %in% installed.packages()[,"Package"])]
new_pack if(length(new_pack)) install.packages(new_pack)
}
if_not_install(c("ordinal", "tidyverse", "stevedata"))
Let’s load the packages now.
library(tidyverse, quietly = TRUE)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stevedata)
library(ordinal) # Notice the function conflict here. The price of doing business
##
## Attaching package: 'ordinal'
## The following object is masked from 'package:dplyr':
##
## slice
Let’s revisit an old data frame from earlier in the semester. You’ve seen these before. If you don’t remember, they’re attitudes recorded in the 2018 General Social Survey about attitudes toward various spending programs in the United States. The bulk of the spending items are coded such that -1 = the respondent thinks we’re spending too much on this particular topic, 0 = the respondent thinks the U.S. is spending “about (the) right” amount, and 1 = the responding thinks the country is spending too little on the topic. Conceptually, I think of these items as communicating attitudes about support for more spending on an ordered categorical scale. Higher values = a respondent implicitly thinking the U.S. should spend more on these topics.
gss_spending
## # A tibble: 2,348 x 33
## year id age sex educ degree race rincom16 partyid polviews xnorcsiz
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2018 1 43 0 14 2 1 NA 5 6 6
## 2 2018 2 74 1 10 1 1 NA 2 NA 6
## 3 2018 3 42 0 16 3 1 22 4 5 6
## 4 2018 4 63 1 16 3 1 23 2 4 6
## 5 2018 5 71 0 18 4 2 NA 6 7 6
## 6 2018 6 67 1 16 3 1 NA 2 3 6
## 7 2018 7 59 1 13 1 2 12 0 4 2
## 8 2018 8 43 0 12 1 1 17 5 5 2
## 9 2018 9 62 1 8 0 1 2 3 4 2
## 10 2018 10 55 0 12 1 1 22 1 NA 4
## # … with 2,338 more rows, and 22 more variables: news <dbl>, wrkstat <dbl>,
## # natspac <dbl>, natenvir <dbl>, natheal <dbl>, natcity <dbl>,
## # natcrime <dbl>, natdrug <dbl>, nateduc <dbl>, natrace <dbl>, natarms <dbl>,
## # nataid <dbl>, natfare <dbl>, natroad <dbl>, natsoc <dbl>, natmass <dbl>,
## # natpark <dbl>, natchld <dbl>, natsci <dbl>, natenrgy <dbl>, sumnat <dbl>,
## # sumnatsoc <dbl>
?gss_spending
Last year’s script, in the interest of full disclosure, did this kind of stream of consciousness. So, I already know what kind of weirdness I can expect from these models. No matter, let’s plow forward. First, I want to do some recoding here. Let’s create a simple dummy variable that recodes the degree
variable into a new one for those with a four-year college diploma. Let’s make sure we drop the other party supporters from the partisanship (pid7
) variable. Finally, and this is important, let’s declare some of these spending prompts to be ordered-categorical variables with the ordered()
function that comes in base R. We’ll focus on two prompts—attitudes toward spending on welfare (natfare
) and attitudes about spending on social security (natsoc
)—and 1) coerce both to be ordered-categorical variables and 2) add them together and coerce that to be an ordered-categorical variable. Basically, the {ordinal}
functions require a dependent variable that is explicitly declared beforehand as an ordered factor. If you don’t do this, the errors you get from the {ordinal}
function won’t quite point you in this direction.
%>%
gss_spending mutate(collegeed = ifelse(degree >= 3, 1, 0),
pid7 = ifelse(partyid == 7, NA, partyid),
natfaresoc = natsoc + natfare,
natfare_f = ordered(natfare),
natsoc_f = ordered(natsoc),
natfaresoc_f = ordered(natfaresoc)) -> gss_spending
Let’s assume we want to model attitudes toward welfare spending among white people as a function of these things: age, sex (whether respondent is a woman), college education, income, partisanship (D to R), and ideology (L to C). You’d do this with the clm()
function in the {ordinal}
package even as the syntax you see looks like every other regression model you’d estimate in R.
<- clm(natfare_f ~ age + sex + collegeed + rincom16 +
M1 + polviews, data=subset(gss_spending, race == 1))
pid7
summary(M1)
## formula: natfare_f ~ age + sex + collegeed + rincom16 + pid7 + polviews
## data: subset(gss_spending, race == 1)
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 880 -893.41 1802.82 4(0) 6.98e-10 1.3e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## age 0.001929 0.004665 0.414 0.679205
## sex -0.393465 0.133582 -2.946 0.003224 **
## collegeed -0.152845 0.143269 -1.067 0.286047
## rincom16 0.001445 0.011849 0.122 0.902927
## pid7 -0.152440 0.046076 -3.308 0.000938 ***
## polviews -0.269269 0.061800 -4.357 1.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## -1|0 -2.8745 0.3345 -8.593
## 0|1 -1.4221 0.3230 -4.403
## (813 observations deleted due to missingness)
Remember: interpreting coefficients in your “first step” is functionally identical to what you’d do from an OLS model. That is, you’re looking for sign of coefficients and statistical significance. Here would be the preliminary takeaways, none of which are particularly surprising beyond the gender effect. In these data, the statistically significant effects are for women, ideology, and partisanship and all three are negative. Women are less likely than me to think about spending more on welfare. The ideology and partisanship effects are unsurprising if you have at least a cursory understanding on American politics.
I tend to use very general language on coefficient interpretation for ordinal models, but if you want something more exact, here it is. Observe the coefficient for polviews
is ~-.269, which, you’ll recall, is a logit. Thus, the natural logged odds of observing a 1 versus a 0 or -1 decreases by about -.269 for a unit increase in the polviews
variable. Related: the natural logged odds of observing a 0 versus a -1 decreases by about -.269 for a unit increase in the polviews
variable.
I’m generally loathe to talk about these things. They’re not typically parameters of interest for how you’re probably using an ordinal model. However, you’ll want to provide them anyway. These thresholds or “cut points” are natural logged odds between two variables. So, in this case: the “coefficient” reading -1|0 is the natural logged odds of being a -1 versus a 0 or 1. The “coefficient” reading 0|1 is the natural logged odds of being a -1 or 0 versus a 1. The “|” is kind of misleading, especially if you’re used to it as a strict logical operator. In this case, the “|” is like a cumulative cut point, or a way of saying it is.
Let’s talk a bit about what’s happening here. We call ordinal logistic regression an extension of (binary) logistic regression because:
However, we are assuming the lines are in parallel to each other, separated by the thresholds. So, in this case, think of this model as kind of like two logistic regressions, each with identical betas. logit(p(y == -1)) = -2.87 + B*X
and logit(p(y <= 0)) = -1.4221 + B*X
.
You should, at least in spirit, care about the proportional odds assumption that the slopes are the same at every level. There are any number of ways of testing this and I really wish there was a Brant test add-on for the {ordinal}
package. There isn’t (i.e. it’s there for the polr()
function in {MASS}
, which I eschewed here).
Instead, you can do a nominal test, which is the {ordinal}
package’s way of saying “likelihood ratio test.” Think of this as a test of the hypothesis that relaxing the proportional odds (PO) assumption of parallel lines across all levels of the response provides a better model fit. If the p < .05, you reject the hypothesis that relaxing the PO assumption does not improve model fit. In other words, one or more of the covariates may have non-constant effects at all levels.
nominal_test(M1)
## Tests of nominal effects
##
## formula: natfare_f ~ age + sex + collegeed + rincom16 + pid7 + polviews
## Df logLik AIC LRT Pr(>Chi)
## <none> -893.41 1802.8
## age 1 -891.27 1800.5 4.2757 0.03866 *
## sex 1 -891.37 1800.8 4.0733 0.04356 *
## collegeed 1 -892.02 1802.0 2.7897 0.09487 .
## rincom16 1 -893.05 1804.1 0.7190 0.39647
## pid7 1 -892.87 1803.7 1.0793 0.29886
## polviews 1 -891.58 1801.2 3.6647 0.05558 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can interpret the above output in a few ways:
clm()
function. Here, we’ll do it just for age and sex.<- clm(natfare_f ~ collegeed + rincom16 + pid7 + polviews, nominal = ~ age + sex, data=subset(gss_spending, race == 1))
M2 summary(M2) # Notice there's no single coefficient for age and sex. It's in the intercepts/thresholds.
## formula: natfare_f ~ collegeed + rincom16 + pid7 + polviews
## nominal: ~age + sex
## data: subset(gss_spending, race == 1)
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 880 -888.92 1797.83 4(0) 7.46e-07 1.9e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## collegeed -0.1446243 0.1431628 -1.010 0.312395
## rincom16 0.0008882 0.0118489 0.075 0.940248
## pid7 -0.1537127 0.0459901 -3.342 0.000831 ***
## polviews -0.2671823 0.0618067 -4.323 1.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## -1|0.(Intercept) -3.352961 0.383177 -8.750
## 0|1.(Intercept) -1.145077 0.338823 -3.380
## -1|0.age 0.005662 0.005756 0.984
## 0|1.age -0.006709 0.005156 -1.301
## -1|0.sex 0.605863 0.165808 3.654
## 0|1.sex 0.267103 0.146271 1.826
## (813 observations deleted due to missingness)
nominal_test(M2)
## Tests of nominal effects
##
## formula: natfare_f ~ collegeed + rincom16 + pid7 + polviews
## nominal: ~age + sex
## Df logLik AIC LRT Pr(>Chi)
## <none> -888.92 1797.8
## collegeed 1 -886.79 1795.6 4.2600 0.03902 *
## rincom16 1 -888.66 1799.3 0.5109 0.47475
## pid7 1 -888.46 1798.9 0.9208 0.33726
## polviews 1 -887.38 1796.8 3.0642 0.08003 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, however, the nominal test is complaining about the college education variable. At this point, I’m probably just going to throw up my hands and say “multinomial model” for this variable. But, as I mentioned at the top of the script, I wrote this kind of stream of consciousness and I suspect it’s the few response categories we have that’s causing this issue. So, let’s take out that natfare_f
variable and add in the prompt that sums both natfare
and natsoc
together. This creates a five-item variable where -2 = those who think we’re spending too much on both welfare and social security and 2 = those who think we’re spending too little on both. -1, 0, and 1 are also both possible.
Here’d be the breakdown for those.
%>%
gss_spending filter(race == 1) %>%
distinct(natfaresoc, natfare, natsoc) %>%
%>%
na.omit arrange(natfaresoc)
## # A tibble: 9 x 3
## natfare natsoc natfaresoc
## <dbl> <dbl> <dbl>
## 1 -1 -1 -2
## 2 0 -1 -1
## 3 -1 0 -1
## 4 1 -1 0
## 5 -1 1 0
## 6 0 0 0
## 7 0 1 1
## 8 1 0 1
## 9 1 1 2
Now let’s try this again with this new dependent variable that amounts to an index sentiment on whether the respondent thinks the U.S. needs to spend more on social welfare programs, here gauged by prompts on welfare and social security.
# Let's try this again
<- clm(natfaresoc_f ~ age + sex + collegeed + rincom16 + pid7 + polviews, data=subset(gss_spending, race == 1))
M3 summary(M3)
## formula: natfaresoc_f ~ age + sex + collegeed + rincom16 + pid7 + polviews
## data: subset(gss_spending, race == 1)
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 843 -1118.93 2257.87 6(0) 9.88e-13 2.9e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## age 0.004992 0.004541 1.099 0.271632
## sex -0.032170 0.130962 -0.246 0.805960
## collegeed -0.578248 0.140388 -4.119 3.81e-05 ***
## rincom16 0.004499 0.011536 0.390 0.696563
## pid7 -0.156050 0.044665 -3.494 0.000476 ***
## polviews -0.230926 0.059912 -3.854 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## -2|-1 -5.3988 0.4050 -13.330
## -1|0 -3.5578 0.3410 -10.433
## 0|1 -1.7119 0.3199 -5.351
## 1|2 -0.4677 0.3153 -1.483
## (850 observations deleted due to missingness)
We do see there’s no longer a significant gender difference, but the college education variable emerges as negative and statistically significant. The ideology and partisanship variables are the same, basically. More values in the dependent variable mean there are more thresholds through we must sift. However, we’re here for the nominal test. Did we pass?
nominal_test(M3)
## Tests of nominal effects
##
## formula: natfaresoc_f ~ age + sex + collegeed + rincom16 + pid7 + polviews
## Df logLik AIC LRT Pr(>Chi)
## <none> -1118.9 2257.9
## age 3 -1117.9 2261.7 2.1365 0.5446
## sex 3 -1116.7 2259.3 4.5521 0.2077
## collegeed 3 -1116.4 2258.8 5.0680 0.1669
## rincom16 3 -1118.3 2262.6 1.2305 0.7457
## pid7 3 -1118.5 2262.9 0.9151 0.8218
## polviews 3 -1116.0 2258.0 5.8904 0.1171
Much betta.
🔥 #take coming up: I’m of the mentality you should always run an ordinal logistic regression if that’s the DV you’re handed. I will throw something at you if you try running an OLS on a five-item Likert because that’s just not the data you have. But I kind of hate them, and I would forgive you for hating them too, because communicating them is a chore. OLS has a straightforward interpretation. Binary DVs are really straightforward as well. However, the PO assumption can be restrictive and there are a lot of moving pieces from the model output. Your audience may not have the appetite for it.
In other words, be prepared to communicate your statistical model graphically and impose your will on a somewhat unruly model accordingly.
In the {ordinal}
package, you can do this with the predict()
function and think about using it with hypothetical data. For example, let’s create a simple data frame that has all our right-hand side values at their typical values. But, we’ll allow partisanship to vary across three types. These will be the strong Democrats (pid7 == 0
), the pure independents who say they don’t lean one way or the other (pid7 == 3
), and the strong Republicans (pid7 == 6
).
<- tibble(age = median(gss_spending$age, na.rm=T),
newdat collegeed = 0,
sex = 0,
pid7 = c(0, 3, 6),
polviews = median(gss_spending$polviews, na.rm=T),
rincom16 = median(gss_spending$rincom16, na.rm=T))
# who dis newdat
## # A tibble: 3 x 6
## age collegeed sex pid7 polviews rincom16
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 48 0 0 0 4 17
## 2 48 0 0 3 4 17
## 3 48 0 0 6 4 17
Thus, think of three types of people: a typical-aged man of average income, without a college diploma, and who says they’re ideologically moderate. They’re identical, but for their partisanship. One is a strong Democrat, another an independent, and the last a Republican. We want to know what the effect of increasing partisanship “looks like” for these three people across the handful of different responses recorded in the dependent variable. For simplicity’s sake, we’re going to focus on that first model that looked at just attitudes about welfare, even acknowledging the model wasn’t a super great fit for the data.
You’ve been warned: this code is convoluted as hell. It’s why I prefer Bayes for ordinal models, but Bayes is in two weeks.
# Oh god, here we go...
predict(M1, newdata = newdat, se.fit=T) %>% # get predictions with standard errors.
# This is a list of two matrices
# Let's coerce it to two data frames while also begrudging that I have to do this.
map(~as.data.frame(.)) %>% # god purrr is awesome
# There's a hiden rowname in here. It's going to somewhat coincide with the values of pid7
# Let's extract it
map(~rownames_to_column(.)) %>%
# Now let's make these two data frames into one data frame.
# Importantly, obj is going to tell me whether it's a prediction or a standard error around the prediction
map2_df(names(.), ~mutate(.x,obj=.y)) %>%
# alrightie... okay. See that rowname variable? I know that's the pid7 values of 0, 3, and 6.
# However, the clm predict doesn't save those. Let's tell them for what they are.
rename(pid7 = rowname) %>%
# It also delightfully thinks it's a character. So, let's humor it and overwrite it.
mutate(pid7 = rep(c("Strong Democrat", "Independent", "Strong Republican"), 2),
# Make it a factor in order it appears. You'll thank me later for this.
pid7 = forcats::fct_inorder(pid7)) %>%
# okay, tidyr::gather() is going to have to do some heavy lifting here.
gather(var, val, -pid7, -obj) %>%
# Importantly, I needed this longer because I want my -1, 0, and 1s (as responses) to be "long."
# so, now this made it "longer" while still giving me a glimpse as to what's my fit and what's my se.fit
# See that's in the obj column? Let's group_split and bind_cols to get them next to each other
group_split(obj) %>%
bind_cols() %>%
# voila! I have everything I need now
# however, I'll need to rename things and focus on just what I want
rename(pid7 = `pid7...1`,
natfare = `var...3`,
fit = `val...4`,
se = `val...8`) %>%
select(pid7, natfare, fit, se) %>%
# Now, let's have some fun and create a column called upr and lwr creating bounds around the estimate
mutate(upr = fit + 1.96*se,
lwr = fit - 1.96*se) %>%
ggplot(.,aes(pid7, fit, ymax=upr, ymin=lwr)) +
geom_pointrange() +
# Oh god help me I never do anything the easy way...
facet_wrap(~natfare, labeller=labeller(natfare = c("-1" = "Spends Too Much",
"0" = "Spending About Right",
"1" = "Spending Too Little"))) +
labs(title = "Attitudes Toward Spending on Welfare, by Partisanship",
x = "Partisanship", y = "Predicted Probability of the Response (with 95% Intervals)",
caption = "Source: General Social Survey, 2018. Note: for pedagogical use in my grad methods class. Stay out of my mentions.",
subtitle = "Increasing GOP partisanship increases the likelihood of the spend too much or spend about right response, but decreases the likelihood of the\nspend too little response. You knew this.")
## New names:
## * pid7 -> pid7...1
## * obj -> obj...2
## * var -> var...3
## * val -> val...4
## * pid7 -> pid7...5
## * ...
^ Consider this a preview for the quantities of interest week, that’s coming up next. Basically: regression modeling is story-telling as well, in a way. You, the story-teller, just have more work to do with ordinal models, even as the ordinal model may faithfully capture the underlying distribution of the DV.
I want to give you an “out”, of a kind. The truth is OLS models are a better fit on ordered-categorical data than they are on dummy variables. What follows will touch on some of the readings you had this week (and even earlier in the semester) on whether you can treat your ordinal DV as continuous. Here’s my rule of thumb:
^ On the above point in the distribution of responses on a granular ordinal scale. Remember the bribe-taking prompt from the World Values Survey? This was the justifiability of taking a bribe on a 1-10 scale. It has 10 responses, but almost all of them are at 1. In other words, don’t treat that as interval below:
%>%
wvs_justifbribe group_by(f117) %>%
count() %>%
%>%
na.omit ggplot(.,aes(as.factor(f117), n)) +
geom_bar(stat="identity", alpha=0.8, color="black") +
scale_x_discrete(labels=c("Never Justifiable", "2", "3", "4",
"5", "6", "7", "8", "9", "Always Justifiable")) +
scale_y_continuous(labels = scales::comma) +
geom_text(aes(label=n), vjust=-.5, colour="black",
position=position_dodge(.9), size=4) +
labs(y = "Number of Observations in Particular Response",
x = "",
title = "The Justifiability of Taking a Bribe in the World Values Survey, 1981-2016",
caption = "Data: World Values Survey (1981-2016), via ?wvs_justifbribe in {stevedata}",
subtitle = "There are just 10 different responses in this variable with a huge right skew.")
You may not even want to think of it as ordinal. With noisy as hell data like this, as I mentioned in that session, you’ll probably just want to embrace the noisiness and estimate it as a binary DV of 1 versus not 1.
What about our dependent variable from model 3?
summary(M3)
## formula: natfaresoc_f ~ age + sex + collegeed + rincom16 + pid7 + polviews
## data: subset(gss_spending, race == 1)
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 843 -1118.93 2257.87 6(0) 9.88e-13 2.9e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## age 0.004992 0.004541 1.099 0.271632
## sex -0.032170 0.130962 -0.246 0.805960
## collegeed -0.578248 0.140388 -4.119 3.81e-05 ***
## rincom16 0.004499 0.011536 0.390 0.696563
## pid7 -0.156050 0.044665 -3.494 0.000476 ***
## polviews -0.230926 0.059912 -3.854 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## -2|-1 -5.3988 0.4050 -13.330
## -1|0 -3.5578 0.3410 -10.433
## 0|1 -1.7119 0.3199 -5.351
## 1|2 -0.4677 0.3153 -1.483
## (850 observations deleted due to missingness)
summary(M4 <- lm(natfaresoc ~ age + sex + collegeed + rincom16 + pid7 + polviews, data=subset(gss_spending, race == 1)))
##
## Call:
## lm(formula = natfaresoc ~ age + sex + collegeed + rincom16 +
## pid7 + polviews, data = subset(gss_spending, race == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2448 -0.7195 0.0350 0.8696 1.9523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5169444 0.1710779 8.867 < 2e-16 ***
## age 0.0024783 0.0024478 1.012 0.311621
## sex -0.0000742 0.0713339 -0.001 0.999170
## collegeed -0.3179332 0.0766935 -4.146 3.74e-05 ***
## rincom16 0.0025978 0.0062492 0.416 0.677737
## pid7 -0.0840277 0.0241406 -3.481 0.000526 ***
## polviews -0.1318277 0.0319444 -4.127 4.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9997 on 836 degrees of freedom
## (850 observations deleted due to missingness)
## Multiple R-squared: 0.1108, Adjusted R-squared: 0.1045
## F-statistic: 17.37 on 6 and 836 DF, p-value: < 2.2e-16
::tidy(M3) %>%
broomfilter(coef.type != "intercept") %>%
select(-coef.type) %>%
mutate(model = "Ordinal Logistic Regression") %>%
bind_rows(., broom::tidy(M4) %>% mutate(model = "OLS")) %>%
arrange(term)
## # A tibble: 13 x 6
## term estimate std.error statistic p.value model
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 1.52 0.171 8.87 4.48e-18 OLS
## 2 age 0.00499 0.00454 1.10 2.72e- 1 Ordinal Logistic Regress…
## 3 age 0.00248 0.00245 1.01 3.12e- 1 OLS
## 4 collegeed -0.578 0.140 -4.12 3.81e- 5 Ordinal Logistic Regress…
## 5 collegeed -0.318 0.0767 -4.15 3.74e- 5 OLS
## 6 pid7 -0.156 0.0447 -3.49 4.76e- 4 Ordinal Logistic Regress…
## 7 pid7 -0.0840 0.0241 -3.48 5.26e- 4 OLS
## 8 polviews -0.231 0.0599 -3.85 1.16e- 4 Ordinal Logistic Regress…
## 9 polviews -0.132 0.0319 -4.13 4.05e- 5 OLS
## 10 rincom16 0.00450 0.0115 0.390 6.97e- 1 Ordinal Logistic Regress…
## 11 rincom16 0.00260 0.00625 0.416 6.78e- 1 OLS
## 12 sex -0.0322 0.131 -0.246 8.06e- 1 Ordinal Logistic Regress…
## 13 sex -0.0000742 0.0713 -0.00104 9.99e- 1 OLS
^ off, technically wrong, but not the worst I’ve ever seen. In fact, those t/z statistics look very similar even as the underlying coefficients are being communicated on different scales. I’d still say to jettison OLS for the ordinal logistic regression here. Do it for the reasons I hinted at above. If you have just five responses in the DV, I’m probably going to want to know about the extremes and the middle. There are a lot of moving pieces in an ordinal model, but you can focus on just those responses that almost naturally present themselves in this setup. Those who advocating slapping an OLS sticker on all types insist it does well enough being BLUE. My retort to that is 1) I’m not convinced it’s doing that and 2) with so few response categories, OLS is going to struggle in an obvious way providing reasonable fitted values. Ordinal logistic regression is tailored for communicating probabilities (albeit with some work) for finite values. OLS can’t do that.
What about something bigger, like the sumnatsoc
variable in the gss_spending
data? Whereas Model 3 adds just the two prompts together, this sums all responses toward various “social” prompts about the environment, health, dealing with drug addiction, education, improving racial equality, welfare, roads, mass transit, parks, social security, and child care.
Here’s what this variable would look like, all 22 possible responses of it. There’s a bit of a left tail for the anti-government-doing-anything folk, but this has a nice juicy center for a variable with just 22 different responses.
%>%
gss_spending filter(race == 1) %>%
group_by(sumnatsoc) %>%
tally() %>%
ggplot(.,aes(ordered(sumnatsoc), n)) +
geom_bar(stat="identity")
Now, let’s compare the ordinal model with the OLS model.
<- clm(ordered(sumnatsoc) ~ age + sex + collegeed + rincom16 + pid7 + polviews, data=subset(gss_spending, race == 1)) M5
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
## Consider formula(paste(x, collapse = " ")) instead.
<- lm(sumnatsoc ~ age + sex + collegeed + rincom16 + pid7 + polviews, data=subset(gss_spending, race == 1))
M6
::tidy(M5) %>%
broomfilter(coef.type != "intercept") %>%
select(-coef.type) %>%
mutate(model = "Ordinal Logistic Regression") %>%
bind_rows(., broom::tidy(M6) %>% mutate(model = "OLS")) %>%
arrange(term)
## # A tibble: 13 x 6
## term estimate std.error statistic p.value model
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 9.90 0.471 21.0 3.56e-80 OLS
## 2 age -0.00888 0.00413 -2.15 3.18e- 2 Ordinal Logistic Regressi…
## 3 age -0.0209 0.00674 -3.10 2.03e- 3 OLS
## 4 collegeed -0.0504 0.131 -0.386 7.00e- 1 Ordinal Logistic Regressi…
## 5 collegeed -0.106 0.211 -0.501 6.16e- 1 OLS
## 6 pid7 -0.336 0.0428 -7.85 4.05e-15 Ordinal Logistic Regressi…
## 7 pid7 -0.485 0.0672 -7.21 1.22e-12 OLS
## 8 polviews -0.351 0.0564 -6.22 4.94e-10 Ordinal Logistic Regressi…
## 9 polviews -0.589 0.0894 -6.59 7.59e-11 OLS
## 10 rincom16 -0.000650 0.0106 -0.0613 9.51e- 1 Ordinal Logistic Regressi…
## 11 rincom16 0.00316 0.0172 0.184 8.54e- 1 OLS
## 12 sex 0.151 0.120 1.25 2.10e- 1 Ordinal Logistic Regressi…
## 13 sex 0.268 0.197 1.36 1.74e- 1 OLS
Similar performance. No one is going to yell too much at you for doing an OLS on a technically ordinal item that has like 22 different values. But, maybe consider some kind of robust standard error correction.