Elsewhere on My Website

Some of what I write here is taken from my website. Please read these:

Be advised I might plagiarize myself here.

R Packages/Data for This Session

You should’ve already installed two of the following R packages for this lab session. {tidyverse} will be for all things workflow and {stevedata} will be for the toy data sets. Let’s load those libraries now. The order doesn’t really matter here. I think I had mistaken that {broom} was in {tidyverse} and it does not appear to be. No matter It’s a wonderful package for a lot of things—probably one of the best value-added packages in all of R. Install it if you have not.

The following thing below is something I wrote for the IPSA-FLACSO program in Mexico City in 2019—and, sidebar, I miss the hell out of that amazing city and FLACSO should invite me back for things. Basically, this thing looks for whether you have downloaded/installed a package from CRAN and, if you don’t have it, it installs it.

pack <- c("broom")
new_pack <- pack[!(pack %in% installed.packages()[,"Package"])]
if(length(new_pack)) install.packages(new_pack)

Let’s load what we’ll be using here.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(stevedata)
library(broom)

Correlation

Let’s start with an illustration of basic correlation. There are any number of ways of doing a rudimentary correlation in R, but it’s worth noting that the cor() function comes standard in R. However, like the mean(), median() and sd() functions, you’ll need to be mindful of NAs in your data. If you have them, you’ll want to add use = “complete.obs” as an extra parameter in cor().

Let’s do something simple. Let’s correlate the sumnatsoc measure (which increases as the respondent believes we’re spending too little on various social programs) with the respondent’s ideology. As always, check ?gss_spending for more clarity about the measure.

cor(gss_spending$polviews, gss_spending$sumnatsoc, use="complete.obs")
## [1] -0.3510539

This will tell you the Pearon’s r (correlation coefficient) between both is -.35, which is a decently strong negative correlation. Be mindful of a few things with correlation. While there is such a thing as a correlation test for significance (which I’ll talk a little bit about for instrumental variables later in the semester), most assessments of correlation beyond -1, 0, or 1 are informal. Notice the awkwardness of “decently strong negative” correlation. To be clear, the correlation is negative, but the strength of it and how you summarize it is a “use your head” issue here.

Further, and recall, correlation is ambivalent about what is x and what is y. You can flip the two and get the exact same Pearon’s r back.

cor(gss_spending$sumnatsoc, gss_spending$polviews, use="complete.obs")
## [1] -0.3510539

But, if you’re going to just explore without overwriting anything, you can add it to a pipe workflow. Observe:

gss_spending %>%
  summarize(r = cor(polviews, sumnatsoc, use="complete.obs"))
## # A tibble: 1 x 1
##        r
##    <dbl>
## 1 -0.351

I end up doing it this way since it’s less code. {tidyverse} for the win.

Correlation is super useful as preliminary exploratory data analysis, and for checking for multicollinearity concerns. But be mindful. Correlation may mislead you. Let’s start with Anscombe’s quartets.

Anscombe’s Quartets

Francis J. Anscombe’s (1973) famous quartet is a go-to illustration for how simple associational analyses, even linear regression, can mislead the researcher if s/he is not taking care to actually look at the underlying data. Here, Anscombe created four data sets, each with 11 observations and two variables (x and y). All four quartets have the same mean for x and y, the same variance for x and y, the same regression line (y-intercept and slope for x), the same residual sum of squares, and the same correlation. However, they look quite differently.

Let’s look at the data first.

?quartets
quartets
## # A tibble: 44 x 3
##        x     y group    
##    <dbl> <dbl> <chr>    
##  1    10  8.04 Quartet 1
##  2     8  6.95 Quartet 1
##  3    13  7.58 Quartet 1
##  4     9  8.81 Quartet 1
##  5    11  8.33 Quartet 1
##  6    14  9.96 Quartet 1
##  7     6  7.24 Quartet 1
##  8     4  4.26 Quartet 1
##  9    12 10.8  Quartet 1
## 10     7  4.82 Quartet 1
## # … with 34 more rows

Notice these four quartets each have the same mean and standard deviation for x and y.

quartets %>%
  group_by(group) %>%
  summarize_if(is.numeric, list(mean = mean, sd = sd))
## # A tibble: 4 x 5
##   group     x_mean y_mean  x_sd  y_sd
##   <chr>      <dbl>  <dbl> <dbl> <dbl>
## 1 Quartet 1      9   7.50  3.32  2.03
## 2 Quartet 2      9   7.50  3.32  2.03
## 3 Quartet 3      9   7.5   3.32  2.03
## 4 Quartet 4      9   7.50  3.32  2.03

The correlations are effectively the same for all four quartets.

quartets %>%
  group_by(group) %>%
  summarize(r = cor(x, y))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
##   group         r
##   <chr>     <dbl>
## 1 Quartet 1 0.816
## 2 Quartet 2 0.816
## 3 Quartet 3 0.816
## 4 Quartet 4 0.817

The linear function of y ~ x is practically the same across all four quartets. This is where I asked to download the {broom} package. I’ll annotate this a little bit to make it clearer.

quartets %>%
  # nest the data under the particular group
  nest(data = -group) %>%
  # Then, using {purrr} `map()` magic, do a linear regression on each group.
  # Then, using {broom}'s `tidy()` magic to gank the results into a table
  mutate(fit = map(data, ~lm(y ~ x, data=.)),
         tidied = map(fit, broom::tidy)) %>%
  # unnest the table to show us what we want.
  unnest(tidied) %>%
  # select just what we want to see.
  select(group, term:p.value)
## # A tibble: 8 x 6
##   group     term        estimate std.error statistic p.value
##   <chr>     <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 Quartet 1 (Intercept)    3.00      1.12       2.67 0.0257 
## 2 Quartet 1 x              0.500     0.118      4.24 0.00217
## 3 Quartet 2 (Intercept)    3.00      1.13       2.67 0.0258 
## 4 Quartet 2 x              0.5       0.118      4.24 0.00218
## 5 Quartet 3 (Intercept)    3.00      1.12       2.67 0.0256 
## 6 Quartet 3 x              0.500     0.118      4.24 0.00218
## 7 Quartet 4 (Intercept)    3.00      1.12       2.67 0.0256 
## 8 Quartet 4 x              0.500     0.118      4.24 0.00216

However, each quartet looks quite different.

quartets %>%
  ggplot(.,aes(x,y)) +
  facet_wrap(~group) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Basically: look at your data and be mindful that you can be easily misled if you don’t do this.

The Datasaurus

If you want a more extreme example of the same basic thing, check out the Datasaurus data in {stevedata}.

?Datasaurus
Datasaurus
## # A tibble: 1,846 x 3
##    dataset     x     y
##    <chr>   <dbl> <dbl>
##  1 dino     55.4  97.2
##  2 dino     51.5  96.0
##  3 dino     46.2  94.5
##  4 dino     42.8  91.4
##  5 dino     40.8  88.3
##  6 dino     38.7  84.9
##  7 dino     35.6  79.9
##  8 dino     33.1  77.6
##  9 dino     29.0  74.5
## 10 dino     26.2  71.4
## # … with 1,836 more rows

Basically, the people who wrote these data rightly complained that Anscombe wasn’t 100% transparent with how he created the data he did. Plus, you can have more fun with these data. Note that the descriptive stats here are basically identical.

Datasaurus %>%
  group_by(dataset) %>%
  summarize(x_bar = mean(x),
            x_sd = sd(x),
            y_bar = mean(y),
            y_sd = sd(y),
            cor = cor(x, y))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 13 x 6
##    dataset    x_bar  x_sd y_bar  y_sd     cor
##    <chr>      <dbl> <dbl> <dbl> <dbl>   <dbl>
##  1 away        54.3  16.8  47.8  26.9 -0.0641
##  2 bullseye    54.3  16.8  47.8  26.9 -0.0686
##  3 circle      54.3  16.8  47.8  26.9 -0.0683
##  4 dino        54.3  16.8  47.8  26.9 -0.0645
##  5 dots        54.3  16.8  47.8  26.9 -0.0603
##  6 h_lines     54.3  16.8  47.8  26.9 -0.0617
##  7 high_lines  54.3  16.8  47.8  26.9 -0.0685
##  8 slant_down  54.3  16.8  47.8  26.9 -0.0690
##  9 slant_up    54.3  16.8  47.8  26.9 -0.0686
## 10 star        54.3  16.8  47.8  26.9 -0.0630
## 11 v_lines     54.3  16.8  47.8  26.9 -0.0694
## 12 wide_lines  54.3  16.8  47.8  26.9 -0.0666
## 13 x_shape     54.3  16.8  47.8  26.9 -0.0656

The regression line is basically the same.

Datasaurus %>%
  nest(data = -dataset) %>%
  mutate(fit = map(data, ~lm(y ~ x, data=.)),
         tidied = map(fit, broom::tidy)) %>%
  # unnest the table to show us what we want.
  unnest(tidied) %>%
  # select just what we want to see.
  select(dataset, term:p.value)
## # A tibble: 26 x 6
##    dataset term        estimate std.error statistic  p.value
##    <chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 dino    (Intercept)  53.5        7.69      6.95  1.29e-10
##  2 dino    x            -0.104      0.136    -0.764 4.46e- 1
##  3 away    (Intercept)  53.4        7.69      6.94  1.31e-10
##  4 away    x            -0.103      0.135    -0.760 4.48e- 1
##  5 h_lines (Intercept)  53.2        7.70      6.91  1.53e-10
##  6 h_lines x            -0.0992     0.136    -0.732 4.66e- 1
##  7 v_lines (Intercept)  53.9        7.69      7.01  9.38e-11
##  8 v_lines x            -0.112      0.135    -0.824 4.12e- 1
##  9 x_shape (Intercept)  53.6        7.69      6.97  1.17e-10
## 10 x_shape x            -0.105      0.135    -0.778 4.38e- 1
## # … with 16 more rows

But the data look quite different.

Datasaurus %>%
  ggplot(.,aes(x, y)) + geom_point() +
  facet_wrap(~dataset)

Always look at your data. Roar.

Simpson’s Paradox

Simpson’s paradox is a well-known problem of correlation in which a correlation analysis, almost always done in a bivariate context, may reveal a relationship that is reversed upon the introduction of some third variable. Every stylized case I’ve seen of Simpson’s paradox conceptualizes the third factor as some kind of grouping/category effect (certainly of interest to me as a mixed effects modeler), even if this may not be necessary to show a Simpson reversal.

One of the classic pedagogical cases of a Simpson’s paradox is from Deborah Lynne Guber’s (1999) study of public school expenditures and SAT performance in 1994-95 across all 50 states. There are other cases I’ve seen of a Simpson reversal, incidentally most involving education, but these data were readily available and are part of my {stevedata} package as the Guber99 data frame.

?Guber99
Guber99
## # A tibble: 50 x 8
##    state       expendpp ptratio tsalary perctakers verbal  math total
##    <chr>          <dbl>   <dbl>   <dbl>      <int>  <int> <int> <int>
##  1 Alabama         4.40    17.2    31.1          8    491   538  1029
##  2 Alaska          8.96    17.6    48.0         47    445   489   934
##  3 Arizona         4.78    19.3    32.2         27    448   496   944
##  4 Arkansas        4.46    17.1    28.9          6    482   523  1005
##  5 California      4.99    24      41.1         45    417   485   902
##  6 Colorado        5.44    18.4    34.6         29    462   518   980
##  7 Connecticut     8.82    14.4    50.0         81    431   477   908
##  8 Delaware        7.03    16.6    39.1         68    429   468   897
##  9 Florida         5.72    19.1    32.6         48    420   469   889
## 10 Georgia         5.19    16.3    32.3         65    406   448   854
## # … with 40 more rows

In Guber’s (1999) case, she encounters a troubling negative correlation (among other negative correlations) that states that spend more (per pupil) on their public school students have lower SAT scores than those who spend less. Observe:

Guber99 %>%
  summarize(r = cor(expendpp, total))
## # A tibble: 1 x 1
##        r
##    <dbl>
## 1 -0.381

But there’s a lurking variable. Use your head on this one:

  1. The ACT is a test-taking alternative in a lot of these states.
  2. Spendier states may have a regression to the mean, of sorts: they want more students taking these tests/going to college.
  3. Meanwhile, only the most motivated, ambitious few in some of these other states are taking the SAT (to go out of state).

Some quick and dirty quartiles by the percentage of the state taking the SAT will show a Simpson reversal.

Guber99 %>%
  # create quick-and-dirty quartiles for perc of state taking SAT
  mutate(group = as.factor(ntile(perctakers, 4))) %>%
  group_by(group) %>%
  summarize(r = cor(expendpp, total))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
##   group     r
##   <fct> <dbl>
## 1 1     0.278
## 2 2     0.305
## 3 3     0.570
## 4 4     0.142

At every quartile of SAT test-taker percentage, there is a positive correlation of student expenditures on SAT performance statewide. The post on my website has a fancy pair of graphs to drive this home. ## Ecological Fallacy

We have Robinson (1950) to thank by showing that ecological correlations are not substitutes for individual correlations. These data are available as illiteracy30 in my {stevedata} package.

?illiteracy30
illiteracy30
## # A tibble: 49 x 11
##    state    pop pop_il nwhite nwhite_il fpwhite fpwhite_il fbwhite fbwhite_il
##    <chr>  <dbl>  <dbl>  <dbl>     <dbl>   <dbl>      <dbl>   <dbl>      <dbl>
##  1 Maine 6.43e5  17172 4.06e5      5745  136817       2872   98310       8393
##  2 New … 3.82e5  10231 1.88e5      1155  111717       1211   81496       7820
##  3 Verm… 2.92e5   6299 1.84e5      1921   65382       1340   41695       3005
##  4 Mass… 3.51e6 124158 1.08e6      3247 1339077       6405 1042889     111568
##  5 Rhod… 5.60e5  27536 1.55e5       741  227909       1991  168975      24124
##  6 Conn… 1.32e6  59874 4.12e5      1271  507112       2230  378586      55136
##  7 New … 1.05e7 388883 3.49e6     19712 3506583      14942 3150593     341345
##  8 New … 3.33e6 128022 1.22e6      6336 1109603       5236  833727     107192
##  9 Penn… 7.73e6 240323 4.18e6     26002 1972545      10515 1221729     187942
## 10 Ohio  5.43e6 123804 3.41e6     26588 1129731       5799  637535      74131
## # … with 39 more rows, and 2 more variables: black <dbl>, black_il <dbl>

Robinson (1950) collected data from the 1930 Census, which unsurprisingly showed a discrepancy of over-10 literacy rates of non-white, non-native populations relative to the white, native-born population in the U.S. FWIW, consider the time frame here. The public good of education has always been unequally allocated by race and immigrant status. Further, some basic background of which Europeans were migrating to the U.S. and why will underscore what you see here.

# for context: fbwhite = "foreign born white" (i.e. immigrant),
# nwhite = native-born white, 
# fpwhite = white from "foreign/mixed parentage"
illiteracy30 %>%
  summarize_if(is.numeric, sum) %>%
  gather(var) %>%
  separate(var, c("category","literacy"),"_") %>%
  mutate(literacy = ifelse(!is.na(literacy), "Illiterate", "Total Population")) %>%
  group_by(category) %>%
  summarize(prop = min(value)/max(value))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 5 rows [1, 3, 5,
## 7, 9].
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   category    prop
##   <chr>      <dbl>
## 1 black    0.163  
## 2 fbwhite  0.0987 
## 3 fpwhite  0.00559
## 4 nwhite   0.0183 
## 5 pop      0.0434

That literacy rates are lower among migrants at this point is not a controversial thing to say. But, incidentally, the correlation is negative between foreign population and percentage of the population that is illiterate at the state-level. The only caveat to drawing a single line here is that I see a curvilinear relationship even as that’s immaterial for illustrating the ecological fallacy.

illiteracy30 %>%
  mutate(foreignp = fbwhite/pop,
         illiterate = pop_il/pop,
         fbilliterate = fbwhite_il/fbwhite) %>%
  ggplot(.,aes(foreignp, illiterate)) + geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Why? Simple: immigrants were moving to places with lower illiteracy rates, which generally had more economic opportunity (e.g. NY, NJ) Basically, ecological correlations are not substitutes for individual correlations.

Central Limit Theorem

Central limit theorem says:

  1. with an infinite number samples of size n…
  2. from a population of N units…
  3. the sample means will be normally distributed
  4. Further: the mean of sample means would equal the population mean
  5. Random sampling error would equal the standard error of the sample means.

Let’s illustrate this with some thermometer rating data. This is Trump’s therm score from the ANES exploratory testing survey done April 10-18, 2020. I love thermometer data to teach central limit theorem because it shows the underlying principles, even in data that are noisy/ugly as hell. First, read more about the data and see it here. Just so we’re on the same page: higher values indicate “warmer” attitudes toward Trump (i.e. more favorable attitudes). Lower values indicate “colder” attitudes toward Trump (i.e. less favorable attitudes).

?therms
therms
## # A tibble: 3,080 x 2
##    fttrump1 ftobama1
##       <dbl>    <dbl>
##  1       80       75
##  2       85       15
##  3       NA       NA
##  4       80       50
##  5        0      100
##  6       35        5
##  7       30       90
##  8       60       50
##  9        0       85
## 10        1       85
## # … with 3,070 more rows

Observe how ugly as hell the distribution of this variable is.

therms %>%
  ggplot(.,aes(fttrump1)) + geom_bar()
## Warning: Removed 7 rows containing non-finite values (stat_count).

These data are ugly, right? Even as the mean and median are not too far removed from each other, the large standard deviation on a bounded 0-100 scale suggests something is weird about the data (that you just saw).

therms %>% select(fttrump1) %>%
  na.omit %>%
  summarize_all(list(mean = mean,
                     median = median,
                     sd = sd))
## # A tibble: 1 x 3
##    mean median    sd
##   <dbl>  <dbl> <dbl>
## 1  42.4     40  38.8

So let’s do something here. Let’s treat this therm score as the “population”. I.e. of 2500 or so observations, this constitutes the full universe of thermometer ratings for Donald Trump. Now, we we want to see if we can approximate it by sampling from it.

therms %>%
  select(fttrump1) %>% na.omit %>%
  pull(fttrump1) -> Trump

Now check this shit out. First, let’s set a reproducible seed for sampling so you and I get the same samples back. PRO TIP: you should always set a reproducible seed for any simulation. It’s good practice for max reproducibility. We are going to take 1,000,000 samples of 10 observations from the data and report the means of them. Let’s be clear: no one should ever strive to have a random sample of 10 people. But we’re going for something here:

set.seed(8675309) # Jenny, I got your number...
trumpsamples <- tibble(
  samplemean=sapply(1:1000000,
           function(i){ x <- mean(
             sample(Trump, 10,
                    replace = FALSE))
           }))

^ this might take a few seconds. You’re doing a million samples, after all. Let’s plot the distribution of our sample means. Here’s a histogram underneath a density function of the normal distribution with the parameters derived from the simulations.

trumpsamples %>%
  ggplot(.,aes(samplemean)) + geom_histogram(binwidth=.5,aes(y=..density..)) +
  stat_function(fun=dnorm,
                color="red",
                args=list(mean=mean(trumpsamples$samplemean),
                          sd=sd(trumpsamples$samplemean))) +
  xlab("Sample Mean") + ylab("Density")

Looks “normal”, right?

Before going too much further, let’s talk about a normal distribution. Underlying concept: data routinely cluster around some central tendency and deviations from it are less common. We owe the articulation/formalization of this concept to Carl Friedrich Gauss, who discovered this bell curve function. In Gauss’ formulation: f(x) = (1/sqrt(2pisigma^2))exp({-(x - mu)^2/2*sigma^2}) where mu is the mean and sigma is the variance. Some important properties of the normal density function/Gaussian distribution:

  1. The tails are asymptote to zero. They approach it but don’t touch it.
  2. The kernel inside the exponent is a basic parabola. The negative sign flips it downward.
  3. It’s denoted as a function and not a probability because it is a continuous distribution. The probability of any one value is near zero.
  4. The distribution is perfectly symmetrical. x is as far from mu as -x is from mu.
  5. x is unrestricted. It can be any value.
  6. mu defines the central tendency
  7. sigma defines how short/wide the distribution is.
  8. A special case occurs when mu is zero and sigma is 1. The formula becomes a lot simpler. I’ll belabor that bit later.

Anywho, how did we do?

mean(Trump)
## [1] 42.41913
mean(trumpsamples$samplemean)
## [1] 42.41954

Not bad. We were off by around 4/10000th of a point. A million is a lot, but not infinity after all. The principle that emerges is still clear.

Central limit theorem shows its ideal to take a shitload of samples to guess a population parameter from it. Sometimes, that’s not practical. You really can’t take infinity samples of the United States in a time-invariant way. So, there’s another way: get a good-sized random sample. If you understand random sampling error, you’ll know you can’t do much about the variation inherent in the population. In our case, these data are ugly as hell. What you can do, however, is reduce the sampling error by increasing the size of the sample.

Here’s how we’ll do it. We’re going to take 10 random samples of sizes 10, 100, 400, and 1000. Observe:

moresamples = tibble()

set.seed(8675309) # Jenny, I got your number...
for(j in c(10, 100, 400, 1000)) {
  hold_this <- tibble(
    x= j,
    y=sapply(1:10,
             function(i){ x <- mean(
               sample(Trump, j,
                      replace = FALSE))
             }))
  moresamples <- bind_rows(moresamples, hold_this)
}

Now let’s look at the results. In the code below, each dot coincides with one of 10 sample means of a given sample size (10, 100, 400, and 1000). The dashed horizontal line communicates the actual population mean as we understand it.

moresamples %>%
  ggplot(.,aes(x, y)) + geom_point(size = 3) +
  scale_x_continuous(breaks = c(10, 100, 400, 1000)) +
  xlab("Sample Size") + ylab("Sample Mean") +
  geom_hline(aes(yintercept=mean(Trump)), linetype="dashed")

Here’s how you should interpret the plot above. Notice that 10 samples of 10 gets you results all over the place. This is unsurprising. We have a few samples. The sample size is small. The variation in the population is huge. We’re all over the place. Things start to get better at 100 and 400 samples. Notice how increasing the sample size increases the accuracy of the sample mean. They’re starting to bunch around the actual population mean. The 10 samples of size 1,000 are all hovering closely around the actual population mean. PRO TIP: survey researchers have learned that a national poll “sweet spot” is about 1000 respondents. Most polls you see are thereabouts in size. It’s a “bang for your buck” thing. Fewer than 1000 observations and your random sampling error starts to get appreciably bigger. Any more than that and, generally, you might find yourself careening into Literary Digest Poll territory. To be clear: others do some large-N surveys quite well, but increasing the sample size just to increase it does incentivize some non-random sampling. Plus, it’s more expensive and the “payout” from a larger sample size may not be worth the cost. It’s why you get that sweet spot you routinely observe from opinion polls.

I mentioned above that a special case of the normal distribution occurs when mu is zero and sigma is 1. It allows for an easier summary of the distribution and the areas underneath it. To reiterate: the probability of any one particular value is basically zero (since it’s continuous), but the area underneath the distribution constitutes the full domain and sums to 1. There are some set intervals around the mu that emerge in interesting ways. Observe where the cutoffs are for 68% of the distribution around the mean, 90%, 95%, and 99%.

# For ease, just use normal_dist() in {stevemisc} 
ggplot(data.frame(x = c(-4, 4)), aes(x)) +
  stat_function(fun = dnorm,
                xlim = c(qnorm(.05),abs(qnorm(.05))), size=0,
                geom = "area", fill="#F66733", alpha=0.5) +
  stat_function(fun = dnorm,
                xlim = c(qnorm(.025),abs(qnorm(.025))), size=0,
                geom = "area", fill="#F66733", alpha=0.4) +
  stat_function(fun = dnorm,
                xlim = c(qnorm(.005),abs(qnorm(.005))), size=0,
                geom = "area", fill="#F66733", alpha=0.3) +
  geom_segment(x=1, y=0, xend=1, yend=dnorm(1,0,1), color="white", linetype="dashed") +
  geom_segment(x=-1, y=0, xend=-1, yend=dnorm(1,0,1), color="white", linetype="dashed") +
  annotate(geom = "text", x = 0, y = 0.2,
           label = "68%", size =4.5, color="white") +
  geom_segment(x=-0.15, y=.2, xend=-.99, yend=.2, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  geom_segment(x=0.15, y=.2, xend=.99, yend=.2, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  annotate(geom = "text", x = 0, y = 0.1,
           label = "90%", size =4.5, color="white") +
  geom_segment(x=-0.15, y=.1, xend=-1.64, yend=.1, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  geom_segment(x=0.15, y=.1, xend=1.64, yend=.1, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  annotate(geom = "text", x = 0, y = 0.05,
           label = "95%", size =4.5, color="white") +
  geom_segment(x=-0.15, y=.05, xend=-1.95, yend=.05, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  geom_segment(x=0.15, y=.05, xend=1.95, yend=.05, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  annotate(geom = "text", x = 0, y = 0.01,
           label = "99%", size =4.5, color="white") +
  geom_segment(x=-0.15, y=.01, xend=-2.57, yend=.01, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  geom_segment(x=0.15, y=.01, xend=2.57, yend=.01, color="white",
               arrow = arrow(length = unit(0.15, "cm"))) +
  stat_function(fun = dnorm, color="#522D80", size=1.5) +
  scale_x_continuous(breaks=c(-4, -2.58, -1.96, -1.645, -1, 0,
                              1, 1.645, 1.96, 2.58, 4)) +
  ggtitle("The Area Underneath a Normal Distribution")

It’s why standardization is useful, among other things. Standardization (where the mean is zero and the sd = 1) is easy to calculate.

trumpsamples %>%
  mutate(z = (samplemean-mean(samplemean))/sd(samplemean)) -> trumpsamples

Let’s compare our standardized million sample means to the basic normal distribution.

trumpsamples %>%
ggplot(., aes(z)) + geom_density() +
  stat_function(fun=dnorm,
                color="red"
  ) +
  xlab("Sample Mean (Standardized)") + ylab("Density") +
  ggtitle("Distribution of Sample Means from a Million Samples of Ten Respondents") +
  labs(caption="Data: 2018 ANES pilot study.")

It also allows us to make some kind of inferential statements. So let’s do this: We are going to collect 100 observations, randomly sampled, from this “population.” We’re playing god here, if you will. We know what the “population” mean is. We also know the “population” standard deviation. But, let’s see what we can do here:

set.seed(8675309) # Jenny, I got your number...
oursample <- tibble(x = sample(Trump, 100, replace = FALSE))
mean(oursample$x)
## [1] 43.41

Let’s get a summary of what we did.

oursample %>%
  summarize(sigma = sd(Trump), # sd of the "population"
            mu = mean(Trump), # mean of the "population"
            mean = mean(x), # mean of our sample
            sem = sigma/sqrt(nrow(.)), # standard  error of the sample mean
            lwr = mean - 1.96*sem,
            upr = mean + 1.96*sem) -> oursamplesummaries

oursamplesummaries
## # A tibble: 1 x 6
##   sigma    mu  mean   sem   lwr   upr
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  38.8  42.4  43.4  3.88  35.8  51.0

Here’s what we did in that code: We collected the known mean and sigma (standard deviation) from the “population”. We collected the mean from our random sample. We calculated the standard error of the sample mean (sem). It equals sigma over the square root of sample observations. We calculated the 95% confidence interval around the sample mean, given what we learned about the normal distribution above. This is the interval in which 95% of all possible sample estimates will fall by chance. If we took 100 samples of n = 100, 95 of those random samples on average would have sample means between lwr and upr. We’re not saying, for the moment, that’s where the true population mean is. We don’t necessarily know that. But even this has some nice properties. Question: how likely is was our sample mean, given the actual population mean? You can answer this question by getting the z value, which communicates distance from a proposed mu. Here, that’s the actual mu, which we know. Then, you can calculate the probability of observing that z-value.

oursamplesummaries %>%
  mutate(z = (mean - mu)/sem,
         p = 1-pnorm(abs(z)))
## # A tibble: 1 x 8
##   sigma    mu  mean   sem   lwr   upr     z     p
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  38.8  42.4  43.4  3.88  35.8  51.0 0.255 0.399

Here’s how you’d interpret this: if mu were actually ~42.4 (or whatever exact value it is), the probability of that we observed our x-bar of 43.4 (rounded in the output you see) is around .399. That’s kind of a probable result. Our sample mean is obviously not the population mean, but it’s in orbit and a probabilistic draw from the population. What if we were to test against a claim that Trump’s thermometer rating is actually 73.88? Source: this is the vote share he got county-wide (Pickens, SC) in 2016. To be clear, vote share != thermometer rating, but let’s have fun with it.

oursamplesummaries %>%
  mutate(z = (mean - 73.88)/sem,
         p = 1-pnorm(abs(z)))
## # A tibble: 1 x 8
##   sigma    mu  mean   sem   lwr   upr     z        p
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1  38.8  42.4  43.4  3.88  35.8  51.0 -7.84 2.22e-15

Here’s how you’d interpret this: if the true population mean were 73.88 (which it clearly isn’t), the sample mean we got is almost 8 standard errors (!) away from it. The probability we observed it is something like .0000000000000002220446. We can reject the claim that 73.88 is the population mean with a fair bit of confidence. In fact, we know it’s not.

What about 50? Source: Trump would brag about 50% approval rating, when he can find a poll that shows it, when no other president would brag about that. You can do the same thing, with same caveat that approval rating != thermometer rating, but alas…

# 
oursamplesummaries %>%
  mutate(z = (mean - 50)/sem,
         p = 1-pnorm(abs(z)))
## # A tibble: 1 x 8
##   sigma    mu  mean   sem   lwr   upr     z      p
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1  38.8  42.4  43.4  3.88  35.8  51.0 -1.70 0.0449

The sample statistic we got suggests that, if Trump’s thermometer rating is truly 50, the probability of us observing the sample mean we got is .0449. That’d be a fluke draw. So, we should reject the claim that Trump’s thermometer rating is 50 (in fact, we know it’s not) and suggest our sample mean is closer to what it actually is. Which is true. We’ve been playing god this entire time.

To this point, we’ve been assuming we know the population sigma, if not the mu, in calculating our standard error of the sample mean. This is a bit unrealistic, obviously. To know sigma is to also know mu. What about when you don’t know the population parameters at all? Student’s t-distribution is your answer. Student’s t-distribution looks like the normal distribution, but with fatter tails for fewer degrees of freedom

So, let’s test against a hypothetical claim of 50/50.

# 
oursample %>%
  summarize(mean = mean(x),
            sem = sd(x)/sqrt(nrow(.)), # notice the numerator
            lwr = mean - 1.96*sem,
            upr = mean + 1.96*sem,
            t = (mean - 50)/sem,
            p = pt(t, nrow(.)))
## # A tibble: 1 x 6
##    mean   sem   lwr   upr     t      p
##   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1  43.4  3.87  35.8  51.0 -1.70 0.0460

Notice the results look very similar to what we just did? It’s because the t-distribution converges on the normal distribution with more degrees of freedom (loosely: more observations, after considering parameters to be estimated). More formally: degrees of freedom = number of obs. (n) - parameters to be estimated (k). Here, we just want a simple mean (k = 1).

# 
ggplot(data.frame(x = c(-4, 4)), aes(x = x)) +
  stat_function(fun = dt, args = list(df = 1), color="#026CCBFF") +
  stat_function(fun = dt, args = list(df = 10), color="#F51E02FF") +
  stat_function(fun = dt, args = list(df = 1000), color="#05B102FF") +
  stat_function(fun = dt, args = list(df = Inf), color="#FB9F53FF") +
  stat_function(fun = dnorm, color="black")