Abstract
This is a lab script for POST 8000, a graduate-level quantitative methods for public policy class that I teach at Clemson University. It will not be the most sophisticated R-related write-up of mine—check my blog for those—but it should be useful for discussion around the associated R script for the week’s ‘lab’ session.Some of what I write here is taken from my website. Please read these:
Be advised I might plagiarize myself here.
You should’ve already installed two of the following R packages for this lab session. {tidyverse}
will be for all things workflow and {stevedata}
will be for the toy data sets. Let’s load those libraries now. The order doesn’t really matter here. I think I had mistaken that {broom}
was in {tidyverse}
and it does not appear to be. No matter It’s a wonderful package for a lot of things—probably one of the best value-added packages in all of R. Install it if you have not.
The following thing below is something I wrote for the IPSA-FLACSO program in Mexico City in 2019—and, sidebar, I miss the hell out of that amazing city and FLACSO should invite me back for things. Basically, this thing looks for whether you have downloaded/installed a package from CRAN and, if you don’t have it, it installs it.
<- c("broom")
pack <- pack[!(pack %in% installed.packages()[,"Package"])]
new_pack if(length(new_pack)) install.packages(new_pack)
Let’s load what we’ll be using here.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(stevedata)
library(broom)
Let’s start with an illustration of basic correlation. There are any number of ways of doing a rudimentary correlation in R, but it’s worth noting that the cor()
function comes standard in R. However, like the mean()
, median()
and sd()
functions, you’ll need to be mindful of NAs in your data. If you have them, you’ll want to add use = “complete.obs” as an extra parameter in cor()
.
Let’s do something simple. Let’s correlate the sumnatsoc
measure (which increases as the respondent believes we’re spending too little on various social programs) with the respondent’s ideology. As always, check ?gss_spending
for more clarity about the measure.
cor(gss_spending$polviews, gss_spending$sumnatsoc, use="complete.obs")
## [1] -0.3510539
This will tell you the Pearon’s r (correlation coefficient) between both is -.35, which is a decently strong negative correlation. Be mindful of a few things with correlation. While there is such a thing as a correlation test for significance (which I’ll talk a little bit about for instrumental variables later in the semester), most assessments of correlation beyond -1, 0, or 1 are informal. Notice the awkwardness of “decently strong negative” correlation. To be clear, the correlation is negative, but the strength of it and how you summarize it is a “use your head” issue here.
Further, and recall, correlation is ambivalent about what is x and what is y. You can flip the two and get the exact same Pearon’s r back.
cor(gss_spending$sumnatsoc, gss_spending$polviews, use="complete.obs")
## [1] -0.3510539
But, if you’re going to just explore without overwriting anything, you can add it to a pipe workflow. Observe:
%>%
gss_spending summarize(r = cor(polviews, sumnatsoc, use="complete.obs"))
## # A tibble: 1 x 1
## r
## <dbl>
## 1 -0.351
I end up doing it this way since it’s less code. {tidyverse}
for the win.
Correlation is super useful as preliminary exploratory data analysis, and for checking for multicollinearity concerns. But be mindful. Correlation may mislead you. Let’s start with Anscombe’s quartets.
Francis J. Anscombe’s (1973) famous quartet is a go-to illustration for how simple associational analyses, even linear regression, can mislead the researcher if s/he is not taking care to actually look at the underlying data. Here, Anscombe created four data sets, each with 11 observations and two variables (x and y). All four quartets have the same mean for x and y, the same variance for x and y, the same regression line (y-intercept and slope for x), the same residual sum of squares, and the same correlation. However, they look quite differently.
Let’s look at the data first.
?quartets quartets
## # A tibble: 44 x 3
## x y group
## <dbl> <dbl> <chr>
## 1 10 8.04 Quartet 1
## 2 8 6.95 Quartet 1
## 3 13 7.58 Quartet 1
## 4 9 8.81 Quartet 1
## 5 11 8.33 Quartet 1
## 6 14 9.96 Quartet 1
## 7 6 7.24 Quartet 1
## 8 4 4.26 Quartet 1
## 9 12 10.8 Quartet 1
## 10 7 4.82 Quartet 1
## # … with 34 more rows
Notice these four quartets each have the same mean and standard deviation for x and y.
%>%
quartets group_by(group) %>%
summarize_if(is.numeric, list(mean = mean, sd = sd))
## # A tibble: 4 x 5
## group x_mean y_mean x_sd y_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Quartet 1 9 7.50 3.32 2.03
## 2 Quartet 2 9 7.50 3.32 2.03
## 3 Quartet 3 9 7.5 3.32 2.03
## 4 Quartet 4 9 7.50 3.32 2.03
The correlations are effectively the same for all four quartets.
%>%
quartets group_by(group) %>%
summarize(r = cor(x, y))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## group r
## <chr> <dbl>
## 1 Quartet 1 0.816
## 2 Quartet 2 0.816
## 3 Quartet 3 0.816
## 4 Quartet 4 0.817
The linear function of y ~ x is practically the same across all four quartets. This is where I asked to download the {broom}
package. I’ll annotate this a little bit to make it clearer.
%>%
quartets # nest the data under the particular group
nest(data = -group) %>%
# Then, using {purrr} `map()` magic, do a linear regression on each group.
# Then, using {broom}'s `tidy()` magic to gank the results into a table
mutate(fit = map(data, ~lm(y ~ x, data=.)),
tidied = map(fit, broom::tidy)) %>%
# unnest the table to show us what we want.
unnest(tidied) %>%
# select just what we want to see.
select(group, term:p.value)
## # A tibble: 8 x 6
## group term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Quartet 1 (Intercept) 3.00 1.12 2.67 0.0257
## 2 Quartet 1 x 0.500 0.118 4.24 0.00217
## 3 Quartet 2 (Intercept) 3.00 1.13 2.67 0.0258
## 4 Quartet 2 x 0.5 0.118 4.24 0.00218
## 5 Quartet 3 (Intercept) 3.00 1.12 2.67 0.0256
## 6 Quartet 3 x 0.500 0.118 4.24 0.00218
## 7 Quartet 4 (Intercept) 3.00 1.12 2.67 0.0256
## 8 Quartet 4 x 0.500 0.118 4.24 0.00216
However, each quartet looks quite different.
%>%
quartets ggplot(.,aes(x,y)) +
facet_wrap(~group) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Basically: look at your data and be mindful that you can be easily misled if you don’t do this.
If you want a more extreme example of the same basic thing, check out the Datasaurus
data in {stevedata}
.
?Datasaurus Datasaurus
## # A tibble: 1,846 x 3
## dataset x y
## <chr> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
## 7 dino 35.6 79.9
## 8 dino 33.1 77.6
## 9 dino 29.0 74.5
## 10 dino 26.2 71.4
## # … with 1,836 more rows
Basically, the people who wrote these data rightly complained that Anscombe wasn’t 100% transparent with how he created the data he did. Plus, you can have more fun with these data. Note that the descriptive stats here are basically identical.
%>%
Datasaurus group_by(dataset) %>%
summarize(x_bar = mean(x),
x_sd = sd(x),
y_bar = mean(y),
y_sd = sd(y),
cor = cor(x, y))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 13 x 6
## dataset x_bar x_sd y_bar y_sd cor
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 16.8 47.8 26.9 -0.0641
## 2 bullseye 54.3 16.8 47.8 26.9 -0.0686
## 3 circle 54.3 16.8 47.8 26.9 -0.0683
## 4 dino 54.3 16.8 47.8 26.9 -0.0645
## 5 dots 54.3 16.8 47.8 26.9 -0.0603
## 6 h_lines 54.3 16.8 47.8 26.9 -0.0617
## 7 high_lines 54.3 16.8 47.8 26.9 -0.0685
## 8 slant_down 54.3 16.8 47.8 26.9 -0.0690
## 9 slant_up 54.3 16.8 47.8 26.9 -0.0686
## 10 star 54.3 16.8 47.8 26.9 -0.0630
## 11 v_lines 54.3 16.8 47.8 26.9 -0.0694
## 12 wide_lines 54.3 16.8 47.8 26.9 -0.0666
## 13 x_shape 54.3 16.8 47.8 26.9 -0.0656
The regression line is basically the same.
%>%
Datasaurus nest(data = -dataset) %>%
mutate(fit = map(data, ~lm(y ~ x, data=.)),
tidied = map(fit, broom::tidy)) %>%
# unnest the table to show us what we want.
unnest(tidied) %>%
# select just what we want to see.
select(dataset, term:p.value)
## # A tibble: 26 x 6
## dataset term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 dino (Intercept) 53.5 7.69 6.95 1.29e-10
## 2 dino x -0.104 0.136 -0.764 4.46e- 1
## 3 away (Intercept) 53.4 7.69 6.94 1.31e-10
## 4 away x -0.103 0.135 -0.760 4.48e- 1
## 5 h_lines (Intercept) 53.2 7.70 6.91 1.53e-10
## 6 h_lines x -0.0992 0.136 -0.732 4.66e- 1
## 7 v_lines (Intercept) 53.9 7.69 7.01 9.38e-11
## 8 v_lines x -0.112 0.135 -0.824 4.12e- 1
## 9 x_shape (Intercept) 53.6 7.69 6.97 1.17e-10
## 10 x_shape x -0.105 0.135 -0.778 4.38e- 1
## # … with 16 more rows
But the data look quite different.
%>%
Datasaurus ggplot(.,aes(x, y)) + geom_point() +
facet_wrap(~dataset)
Always look at your data. Roar.
Simpson’s paradox is a well-known problem of correlation in which a correlation analysis, almost always done in a bivariate context, may reveal a relationship that is reversed upon the introduction of some third variable. Every stylized case I’ve seen of Simpson’s paradox conceptualizes the third factor as some kind of grouping/category effect (certainly of interest to me as a mixed effects modeler), even if this may not be necessary to show a Simpson reversal.
One of the classic pedagogical cases of a Simpson’s paradox is from Deborah Lynne Guber’s (1999) study of public school expenditures and SAT performance in 1994-95 across all 50 states. There are other cases I’ve seen of a Simpson reversal, incidentally most involving education, but these data were readily available and are part of my {stevedata}
package as the Guber99
data frame.
?Guber99 Guber99
## # A tibble: 50 x 8
## state expendpp ptratio tsalary perctakers verbal math total
## <chr> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1 Alabama 4.40 17.2 31.1 8 491 538 1029
## 2 Alaska 8.96 17.6 48.0 47 445 489 934
## 3 Arizona 4.78 19.3 32.2 27 448 496 944
## 4 Arkansas 4.46 17.1 28.9 6 482 523 1005
## 5 California 4.99 24 41.1 45 417 485 902
## 6 Colorado 5.44 18.4 34.6 29 462 518 980
## 7 Connecticut 8.82 14.4 50.0 81 431 477 908
## 8 Delaware 7.03 16.6 39.1 68 429 468 897
## 9 Florida 5.72 19.1 32.6 48 420 469 889
## 10 Georgia 5.19 16.3 32.3 65 406 448 854
## # … with 40 more rows
In Guber’s (1999) case, she encounters a troubling negative correlation (among other negative correlations) that states that spend more (per pupil) on their public school students have lower SAT scores than those who spend less. Observe:
%>%
Guber99 summarize(r = cor(expendpp, total))
## # A tibble: 1 x 1
## r
## <dbl>
## 1 -0.381
But there’s a lurking variable. Use your head on this one:
Some quick and dirty quartiles by the percentage of the state taking the SAT will show a Simpson reversal.
%>%
Guber99 # create quick-and-dirty quartiles for perc of state taking SAT
mutate(group = as.factor(ntile(perctakers, 4))) %>%
group_by(group) %>%
summarize(r = cor(expendpp, total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## group r
## <fct> <dbl>
## 1 1 0.278
## 2 2 0.305
## 3 3 0.570
## 4 4 0.142
At every quartile of SAT test-taker percentage, there is a positive correlation of student expenditures on SAT performance statewide. The post on my website has a fancy pair of graphs to drive this home. ## Ecological Fallacy
We have Robinson (1950) to thank by showing that ecological correlations are not substitutes for individual correlations. These data are available as illiteracy30
in my {stevedata}
package.
?illiteracy30 illiteracy30
## # A tibble: 49 x 11
## state pop pop_il nwhite nwhite_il fpwhite fpwhite_il fbwhite fbwhite_il
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Maine 6.43e5 17172 4.06e5 5745 136817 2872 98310 8393
## 2 New … 3.82e5 10231 1.88e5 1155 111717 1211 81496 7820
## 3 Verm… 2.92e5 6299 1.84e5 1921 65382 1340 41695 3005
## 4 Mass… 3.51e6 124158 1.08e6 3247 1339077 6405 1042889 111568
## 5 Rhod… 5.60e5 27536 1.55e5 741 227909 1991 168975 24124
## 6 Conn… 1.32e6 59874 4.12e5 1271 507112 2230 378586 55136
## 7 New … 1.05e7 388883 3.49e6 19712 3506583 14942 3150593 341345
## 8 New … 3.33e6 128022 1.22e6 6336 1109603 5236 833727 107192
## 9 Penn… 7.73e6 240323 4.18e6 26002 1972545 10515 1221729 187942
## 10 Ohio 5.43e6 123804 3.41e6 26588 1129731 5799 637535 74131
## # … with 39 more rows, and 2 more variables: black <dbl>, black_il <dbl>
Robinson (1950) collected data from the 1930 Census, which unsurprisingly showed a discrepancy of over-10 literacy rates of non-white, non-native populations relative to the white, native-born population in the U.S. FWIW, consider the time frame here. The public good of education has always been unequally allocated by race and immigrant status. Further, some basic background of which Europeans were migrating to the U.S. and why will underscore what you see here.
# for context: fbwhite = "foreign born white" (i.e. immigrant),
# nwhite = native-born white,
# fpwhite = white from "foreign/mixed parentage"
%>%
illiteracy30 summarize_if(is.numeric, sum) %>%
gather(var) %>%
separate(var, c("category","literacy"),"_") %>%
mutate(literacy = ifelse(!is.na(literacy), "Illiterate", "Total Population")) %>%
group_by(category) %>%
summarize(prop = min(value)/max(value))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 5 rows [1, 3, 5,
## 7, 9].
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## category prop
## <chr> <dbl>
## 1 black 0.163
## 2 fbwhite 0.0987
## 3 fpwhite 0.00559
## 4 nwhite 0.0183
## 5 pop 0.0434
That literacy rates are lower among migrants at this point is not a controversial thing to say. But, incidentally, the correlation is negative between foreign population and percentage of the population that is illiterate at the state-level. The only caveat to drawing a single line here is that I see a curvilinear relationship even as that’s immaterial for illustrating the ecological fallacy.
%>%
illiteracy30 mutate(foreignp = fbwhite/pop,
illiterate = pop_il/pop,
fbilliterate = fbwhite_il/fbwhite) %>%
ggplot(.,aes(foreignp, illiterate)) + geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Why? Simple: immigrants were moving to places with lower illiteracy rates, which generally had more economic opportunity (e.g. NY, NJ) Basically, ecological correlations are not substitutes for individual correlations.
Central limit theorem says:
Let’s illustrate this with some thermometer rating data. This is Trump’s therm score from the ANES exploratory testing survey done April 10-18, 2020. I love thermometer data to teach central limit theorem because it shows the underlying principles, even in data that are noisy/ugly as hell. First, read more about the data and see it here. Just so we’re on the same page: higher values indicate “warmer” attitudes toward Trump (i.e. more favorable attitudes). Lower values indicate “colder” attitudes toward Trump (i.e. less favorable attitudes).
?therms therms
## # A tibble: 3,080 x 2
## fttrump1 ftobama1
## <dbl> <dbl>
## 1 80 75
## 2 85 15
## 3 NA NA
## 4 80 50
## 5 0 100
## 6 35 5
## 7 30 90
## 8 60 50
## 9 0 85
## 10 1 85
## # … with 3,070 more rows
Observe how ugly as hell the distribution of this variable is.
%>%
therms ggplot(.,aes(fttrump1)) + geom_bar()
## Warning: Removed 7 rows containing non-finite values (stat_count).
These data are ugly, right? Even as the mean and median are not too far removed from each other, the large standard deviation on a bounded 0-100 scale suggests something is weird about the data (that you just saw).
%>% select(fttrump1) %>%
therms %>%
na.omit summarize_all(list(mean = mean,
median = median,
sd = sd))
## # A tibble: 1 x 3
## mean median sd
## <dbl> <dbl> <dbl>
## 1 42.4 40 38.8
So let’s do something here. Let’s treat this therm score as the “population”. I.e. of 2500 or so observations, this constitutes the full universe of thermometer ratings for Donald Trump. Now, we we want to see if we can approximate it by sampling from it.
%>%
therms select(fttrump1) %>% na.omit %>%
pull(fttrump1) -> Trump
Now check this shit out. First, let’s set a reproducible seed for sampling so you and I get the same samples back. PRO TIP: you should always set a reproducible seed for any simulation. It’s good practice for max reproducibility. We are going to take 1,000,000 samples of 10 observations from the data and report the means of them. Let’s be clear: no one should ever strive to have a random sample of 10 people. But we’re going for something here:
set.seed(8675309) # Jenny, I got your number...
<- tibble(
trumpsamples samplemean=sapply(1:1000000,
function(i){ x <- mean(
sample(Trump, 10,
replace = FALSE))
}))
^ this might take a few seconds. You’re doing a million samples, after all. Let’s plot the distribution of our sample means. Here’s a histogram underneath a density function of the normal distribution with the parameters derived from the simulations.
%>%
trumpsamples ggplot(.,aes(samplemean)) + geom_histogram(binwidth=.5,aes(y=..density..)) +
stat_function(fun=dnorm,
color="red",
args=list(mean=mean(trumpsamples$samplemean),
sd=sd(trumpsamples$samplemean))) +
xlab("Sample Mean") + ylab("Density")
Looks “normal”, right?
Before going too much further, let’s talk about a normal distribution. Underlying concept: data routinely cluster around some central tendency and deviations from it are less common. We owe the articulation/formalization of this concept to Carl Friedrich Gauss, who discovered this bell curve function. In Gauss’ formulation: f(x) = (1/sqrt(2pisigma^2))exp({-(x - mu)^2/2*sigma^2}) where mu is the mean and sigma is the variance. Some important properties of the normal density function/Gaussian distribution:
Anywho, how did we do?
mean(Trump)
## [1] 42.41913
mean(trumpsamples$samplemean)
## [1] 42.41954
Not bad. We were off by around 4/10000th of a point. A million is a lot, but not infinity after all. The principle that emerges is still clear.
Central limit theorem shows its ideal to take a shitload of samples to guess a population parameter from it. Sometimes, that’s not practical. You really can’t take infinity samples of the United States in a time-invariant way. So, there’s another way: get a good-sized random sample. If you understand random sampling error, you’ll know you can’t do much about the variation inherent in the population. In our case, these data are ugly as hell. What you can do, however, is reduce the sampling error by increasing the size of the sample.
Here’s how we’ll do it. We’re going to take 10 random samples of sizes 10, 100, 400, and 1000. Observe:
= tibble()
moresamples
set.seed(8675309) # Jenny, I got your number...
for(j in c(10, 100, 400, 1000)) {
<- tibble(
hold_this x= j,
y=sapply(1:10,
function(i){ x <- mean(
sample(Trump, j,
replace = FALSE))
}))<- bind_rows(moresamples, hold_this)
moresamples }
Now let’s look at the results. In the code below, each dot coincides with one of 10 sample means of a given sample size (10, 100, 400, and 1000). The dashed horizontal line communicates the actual population mean as we understand it.
%>%
moresamples ggplot(.,aes(x, y)) + geom_point(size = 3) +
scale_x_continuous(breaks = c(10, 100, 400, 1000)) +
xlab("Sample Size") + ylab("Sample Mean") +
geom_hline(aes(yintercept=mean(Trump)), linetype="dashed")
Here’s how you should interpret the plot above. Notice that 10 samples of 10 gets you results all over the place. This is unsurprising. We have a few samples. The sample size is small. The variation in the population is huge. We’re all over the place. Things start to get better at 100 and 400 samples. Notice how increasing the sample size increases the accuracy of the sample mean. They’re starting to bunch around the actual population mean. The 10 samples of size 1,000 are all hovering closely around the actual population mean. PRO TIP: survey researchers have learned that a national poll “sweet spot” is about 1000 respondents. Most polls you see are thereabouts in size. It’s a “bang for your buck” thing. Fewer than 1000 observations and your random sampling error starts to get appreciably bigger. Any more than that and, generally, you might find yourself careening into Literary Digest Poll territory. To be clear: others do some large-N surveys quite well, but increasing the sample size just to increase it does incentivize some non-random sampling. Plus, it’s more expensive and the “payout” from a larger sample size may not be worth the cost. It’s why you get that sweet spot you routinely observe from opinion polls.
I mentioned above that a special case of the normal distribution occurs when mu is zero and sigma is 1. It allows for an easier summary of the distribution and the areas underneath it. To reiterate: the probability of any one particular value is basically zero (since it’s continuous), but the area underneath the distribution constitutes the full domain and sums to 1. There are some set intervals around the mu that emerge in interesting ways. Observe where the cutoffs are for 68% of the distribution around the mean, 90%, 95%, and 99%.
# For ease, just use normal_dist() in {stevemisc}
ggplot(data.frame(x = c(-4, 4)), aes(x)) +
stat_function(fun = dnorm,
xlim = c(qnorm(.05),abs(qnorm(.05))), size=0,
geom = "area", fill="#F66733", alpha=0.5) +
stat_function(fun = dnorm,
xlim = c(qnorm(.025),abs(qnorm(.025))), size=0,
geom = "area", fill="#F66733", alpha=0.4) +
stat_function(fun = dnorm,
xlim = c(qnorm(.005),abs(qnorm(.005))), size=0,
geom = "area", fill="#F66733", alpha=0.3) +
geom_segment(x=1, y=0, xend=1, yend=dnorm(1,0,1), color="white", linetype="dashed") +
geom_segment(x=-1, y=0, xend=-1, yend=dnorm(1,0,1), color="white", linetype="dashed") +
annotate(geom = "text", x = 0, y = 0.2,
label = "68%", size =4.5, color="white") +
geom_segment(x=-0.15, y=.2, xend=-.99, yend=.2, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
geom_segment(x=0.15, y=.2, xend=.99, yend=.2, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
annotate(geom = "text", x = 0, y = 0.1,
label = "90%", size =4.5, color="white") +
geom_segment(x=-0.15, y=.1, xend=-1.64, yend=.1, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
geom_segment(x=0.15, y=.1, xend=1.64, yend=.1, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
annotate(geom = "text", x = 0, y = 0.05,
label = "95%", size =4.5, color="white") +
geom_segment(x=-0.15, y=.05, xend=-1.95, yend=.05, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
geom_segment(x=0.15, y=.05, xend=1.95, yend=.05, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
annotate(geom = "text", x = 0, y = 0.01,
label = "99%", size =4.5, color="white") +
geom_segment(x=-0.15, y=.01, xend=-2.57, yend=.01, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
geom_segment(x=0.15, y=.01, xend=2.57, yend=.01, color="white",
arrow = arrow(length = unit(0.15, "cm"))) +
stat_function(fun = dnorm, color="#522D80", size=1.5) +
scale_x_continuous(breaks=c(-4, -2.58, -1.96, -1.645, -1, 0,
1, 1.645, 1.96, 2.58, 4)) +
ggtitle("The Area Underneath a Normal Distribution")
It’s why standardization is useful, among other things. Standardization (where the mean is zero and the sd = 1) is easy to calculate.
%>%
trumpsamples mutate(z = (samplemean-mean(samplemean))/sd(samplemean)) -> trumpsamples
Let’s compare our standardized million sample means to the basic normal distribution.
%>%
trumpsamples ggplot(., aes(z)) + geom_density() +
stat_function(fun=dnorm,
color="red"
+
) xlab("Sample Mean (Standardized)") + ylab("Density") +
ggtitle("Distribution of Sample Means from a Million Samples of Ten Respondents") +
labs(caption="Data: 2018 ANES pilot study.")
It also allows us to make some kind of inferential statements. So let’s do this: We are going to collect 100 observations, randomly sampled, from this “population.” We’re playing god here, if you will. We know what the “population” mean is. We also know the “population” standard deviation. But, let’s see what we can do here:
set.seed(8675309) # Jenny, I got your number...
<- tibble(x = sample(Trump, 100, replace = FALSE))
oursample mean(oursample$x)
## [1] 43.41
Let’s get a summary of what we did.
%>%
oursample summarize(sigma = sd(Trump), # sd of the "population"
mu = mean(Trump), # mean of the "population"
mean = mean(x), # mean of our sample
sem = sigma/sqrt(nrow(.)), # standard error of the sample mean
lwr = mean - 1.96*sem,
upr = mean + 1.96*sem) -> oursamplesummaries
oursamplesummaries
## # A tibble: 1 x 6
## sigma mu mean sem lwr upr
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 38.8 42.4 43.4 3.88 35.8 51.0
Here’s what we did in that code: We collected the known mean and sigma (standard deviation) from the “population”. We collected the mean from our random sample. We calculated the standard error of the sample mean (sem). It equals sigma over the square root of sample observations. We calculated the 95% confidence interval around the sample mean, given what we learned about the normal distribution above. This is the interval in which 95% of all possible sample estimates will fall by chance. If we took 100 samples of n = 100, 95 of those random samples on average would have sample means between lwr and upr. We’re not saying, for the moment, that’s where the true population mean is. We don’t necessarily know that. But even this has some nice properties. Question: how likely is was our sample mean, given the actual population mean? You can answer this question by getting the z value, which communicates distance from a proposed mu. Here, that’s the actual mu, which we know. Then, you can calculate the probability of observing that z-value.
%>%
oursamplesummaries mutate(z = (mean - mu)/sem,
p = 1-pnorm(abs(z)))
## # A tibble: 1 x 8
## sigma mu mean sem lwr upr z p
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 38.8 42.4 43.4 3.88 35.8 51.0 0.255 0.399
Here’s how you’d interpret this: if mu were actually ~42.4 (or whatever exact value it is), the probability of that we observed our x-bar of 43.4 (rounded in the output you see) is around .399. That’s kind of a probable result. Our sample mean is obviously not the population mean, but it’s in orbit and a probabilistic draw from the population. What if we were to test against a claim that Trump’s thermometer rating is actually 73.88? Source: this is the vote share he got county-wide (Pickens, SC) in 2016. To be clear, vote share != thermometer rating, but let’s have fun with it.
%>%
oursamplesummaries mutate(z = (mean - 73.88)/sem,
p = 1-pnorm(abs(z)))
## # A tibble: 1 x 8
## sigma mu mean sem lwr upr z p
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 38.8 42.4 43.4 3.88 35.8 51.0 -7.84 2.22e-15
Here’s how you’d interpret this: if the true population mean were 73.88 (which it clearly isn’t), the sample mean we got is almost 8 standard errors (!) away from it. The probability we observed it is something like .0000000000000002220446. We can reject the claim that 73.88 is the population mean with a fair bit of confidence. In fact, we know it’s not.
What about 50? Source: Trump would brag about 50% approval rating, when he can find a poll that shows it, when no other president would brag about that. You can do the same thing, with same caveat that approval rating != thermometer rating, but alas…
#
%>%
oursamplesummaries mutate(z = (mean - 50)/sem,
p = 1-pnorm(abs(z)))
## # A tibble: 1 x 8
## sigma mu mean sem lwr upr z p
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 38.8 42.4 43.4 3.88 35.8 51.0 -1.70 0.0449
The sample statistic we got suggests that, if Trump’s thermometer rating is truly 50, the probability of us observing the sample mean we got is .0449. That’d be a fluke draw. So, we should reject the claim that Trump’s thermometer rating is 50 (in fact, we know it’s not) and suggest our sample mean is closer to what it actually is. Which is true. We’ve been playing god this entire time.
To this point, we’ve been assuming we know the population sigma, if not the mu, in calculating our standard error of the sample mean. This is a bit unrealistic, obviously. To know sigma is to also know mu. What about when you don’t know the population parameters at all? Student’s t-distribution is your answer. Student’s t-distribution looks like the normal distribution, but with fatter tails for fewer degrees of freedom
So, let’s test against a hypothetical claim of 50/50.
#
%>%
oursample summarize(mean = mean(x),
sem = sd(x)/sqrt(nrow(.)), # notice the numerator
lwr = mean - 1.96*sem,
upr = mean + 1.96*sem,
t = (mean - 50)/sem,
p = pt(t, nrow(.)))
## # A tibble: 1 x 6
## mean sem lwr upr t p
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43.4 3.87 35.8 51.0 -1.70 0.0460
Notice the results look very similar to what we just did? It’s because the t-distribution converges on the normal distribution with more degrees of freedom (loosely: more observations, after considering parameters to be estimated). More formally: degrees of freedom = number of obs. (n) - parameters to be estimated (k). Here, we just want a simple mean (k = 1).
#
ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
stat_function(fun = dt, args = list(df = 1), color="#026CCBFF") +
stat_function(fun = dt, args = list(df = 10), color="#F51E02FF") +
stat_function(fun = dt, args = list(df = 1000), color="#05B102FF") +
stat_function(fun = dt, args = list(df = Inf), color="#FB9F53FF") +
stat_function(fun = dnorm, color="black")