Configure Rstudio

When you’re opening R for the very first time, it’ll be useful to just get a general sense of what’s happening. I have a beginner’s guide that I wrote in 2014 (where did the time go!). Notice that I built it around Rstudio, which you should download as well. Rstudio desktop is free. Don’t pay for a “pro” version. You’re not running a server. You won’t need it.

When you download and install Rstudio on top of R, you should customize it just a tiny bit to make the most of the graphical user interface. To do what I recommend doing, select “Tools” in the menu. Scroll to “global options” (which should be at the bottom). On the pop-up, select “pane layout.” Rearrange it so that “Source” is top left, “Console” is top right, and the files/plots/packages/etc. is the bottom right. Thereafter: apply the changes.

You don’t have to do this, but I think you should since it better economizes space in Rstudio. The other pane (environment/history, Git, etc.) is stuff you can either learn to not need (e.g. what’s in the environment) or will only situationally need at an advanced level (e.g. Git information). Minimize that outright. When you’re in Rstudio, much of what you’ll be doing leans on the script window and the console window. You’ll occasionally be using the file browser and plot panes as well.

If you have not done so already, open a new script (Ctrl-Shift-N in Windows/Linux or Cmd-Shift-N in Mac) to open a new script.

Get Acclimated in R

Now that you’ve done that, let’s get a general sense of where you are in an R session.

Current Working Directory

First, let’s start with identifying the current working directory. You should know where you are and this happens to be where I am, given the location of this script.

getwd()
## [1] "/home/steve/Dropbox/teaching/post8000/lab-scripts"

Of note: by default, R’s working directory is the system’s “home” directory. This is somewhat straightforward in Unix-derivative systems, where there is an outright “home” directory. Assume your username is “steve”, then, in Linux, your home directory will be “/home/steve”. In Mac, I think it’s something like “/Users/steve”. Windows users will invariably have something clumsy like “C:/Users/steve/Documents”. Notice the forward slashes. R, like everything else in the world, uses forward slashes. The backslashes owe to Windows’ derivation from DOS.

Create “Objects”

Next, let’s create some “objects.” R is primarily an “object-oriented” programming language. In as many words, inputs create outputs that may be assigned to objects in the workspace. You can go nuts here. Of note: I’ve seen R programmers use =, ->, and <- interchangeably for object assignment, but I’ve seen instances where = doesn’t work as intended for object assignment. -> is an option and I use it for assignment for some complex objects in a “pipe” (more on that later). For simple cases (and for beginners), lean on <-.

a <- 3
b <- 4 
A <- 7
a + b
## [1] 7
A + b
## [1] 11
# what objects did we create?
# Notice we did not save a + b or A + b to an object
# Also notice how a pound sign creates a comment? Kinda cool, right? Make comments to yourself.
ls()
## [1] "a" "A" "b"

Some caution, though. First, don’t create objects with really complex names. To call them back requires getting every character right in the console or script. Why inconvenience yourself? Second, R comes with some default objects that are kinda important and can seriously ruin things downstream. I don’t know off the top of my head all the default objects in R, but there are some important ones like pi, TRUE, and FALSE that you DO NOT want to overwrite. You can, however, assign some built-in objects to new objects.

this_Is_a_long_AND_WEIRD_objEct_name_and_yOu_shoUld_not_do_this <- 5
pi # notice there are a few built-in functions/objects
## [1] 3.141593
d <- pi # you can assign one built-in object to a new object.
# pi <- 3.14 # don't do this....

If you do something dumb (like overwrite TRUE with something), all hope is not lost. However, your session is. Restart R and you’ll reclaim some built-in object that you overwrote.

Install/Load Libraries

R depends on user-created libraries to do much of its functionality. This class will lean on just a few R libraries. The first, {tidyverse} is our workhorse for workflow. It’ll also be the longest to install because it comes with lots of dependencies to maximize its functionality. The second, {devtools}, is my go-to interface for downloading development packages off Github (and I’ll eventually ask you to download/load my toy R package, {stevemisc}). My hunch is installation of this package will probably give some of you Mac users some headaches. The answer to these headaches is probably “update Xcode”. The third package includes a whole host of toy data I’ve created over time. {stevedata} is available on CRAN. If you have yet to install these packages (and you almost certainly have not if you’re opening R for the first time), install it as follows. Note that I’m just commenting out this command so it doesn’t do this when I compile this script on my end.

# Take out the comment...
# install.packages(c("tidyverse","devtools", "stevedata"))

Once it’s installed, load the libraries with the library() command. Of note: you only need to install a package once, but you’ll need to load the library for each R session. A caveat, though: I may ask you to upgrade {stevedata} at some point in the semester (and probably through Github) if I add a new data set for the lab session.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(devtools)
## Loading required package: usethis
library(stevedata)

Load Data

The next thing you’ll want to do is load data into R (and assign it to an object). You can do this any number of ways. For what it’s worth, the toy data I’ll be using throughout the semester will (ideally) all be in {stevedata}. The data I’ll be using down the script is built into {stevedata}. However, you can load data from the hard drive or even the internet. Some commands you’ll want to learn:

  • haven::read_dta(): for loading Stata .dta files
  • haven::read_spss(): for loading SPSS binaries
  • read_csv(): for loading comma-separated values (CSV) files
  • readxl::read_excel(): for loading MS Excel spreadsheets.
  • read_tsv(): for tab-separated values (TSV) files
  • readRDS(): for R serialized data frames, which are awesome for file compression/speed. These wrappers are also flexible with files on the internet. For example, these will work. Just remember to assign them to an object.
# Note: hypothetical data
Apply <- haven::read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
# County unemployment
Cunemp <- read_tsv("https://download.bls.gov/pub/time.series/la/la.data.64.County") 
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   series_id = col_character(),
##   year = col_double(),
##   period = col_character(),
##   value = col_double(),
##   footnote_codes = col_character()
## )
## Warning: 960 parsing failures.
##     row   col expected actual                                                            file
## 1821664 value a double      - 'https://download.bls.gov/pub/time.series/la/la.data.64.County'
## 1821665 value a double      - 'https://download.bls.gov/pub/time.series/la/la.data.64.County'
## 1821666 value a double      - 'https://download.bls.gov/pub/time.series/la/la.data.64.County'
## 1821667 value a double      - 'https://download.bls.gov/pub/time.series/la/la.data.64.County'
## 1821668 value a double      - 'https://download.bls.gov/pub/time.series/la/la.data.64.County'
## ....... ..... ........ ...... ...............................................................
## See problems(...) for more details.

Learn Some Important R/“Tidy” Functions

Some R packages, like my {stevedata} package, have built-in data. For example, my pwt_sample data is just a sample of 21 rich countries with information about population size (in millions), human capital per person, real GDP at constant prices (million 2011 USD), and the labor share of income for years spanning 1950 to 2017. Let’s make sure the data are loaded in the environment (they are by default) with the data() command and see the variable names in these data.

data(pwt_sample)
names(pwt_sample)
## [1] "country" "isocode" "year"    "pop"     "hc"      "rgdpna"  "labsh"

You can also type help(pwt_sample) in the R console to learn more about these data.

I want to dedicate the bulk of this section to learning some core functions that are part of the {tidyverse}. My introduction here will inevitably be incomplete because there’s only so much I can teach within the limited time I have. That said, I’m going to focus on the following functions available in the {tidyverse} that totally rethink base R. These are the “pipe” (%>%), glimpse() and summary(), select(), group_by(), summarize(), mutate(), and filter().

The Pipe (%>%)

I want to start with the pipe because I think of it as the most important function in the {tidyverse}. The pipe—represented as %>%—allows you to chain together a series of functions. The pipe is especially useful if you’re recoding data and you want to make sure you got everything the way you wanted (and correct) before assigning the data to another object. You can chain together a lot of {tidyverse} commands with pipes, but we’ll keep our introduction here rather minimal because I want to use it to teach about some other things.

glimpse() and summary()

glimpse() and summary() will get you basic descriptions of your data. Personally, I find summary() more informative than glimpse() though glimpse() is useful if your data have a lot of variables and you want to just peek into the data without spamming the R console without output.

Notice, here, the introduction of the pipe (%>%). In the commands below, pwt_sample %>% glimpse() is equivalent to glimpse(pwt_sample), but I like to lean more on pipes than perhaps others would. My workflow starts with (data) objects, applies various functions to them, and assigns them to objects. I think you’ll get a lot of mileage thinking that same way too.

pwt_sample %>% glimpse() # notice the pipe
## Rows: 1,428
## Columns: 7
## $ country <chr> "Australia", "Australia", "Australia", "Australia", "Australi…
## $ isocode <chr> "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS"…
## $ year    <dbl> 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1…
## $ pop     <dbl> 8.386674, 8.633449, 8.816668, 8.985786, 9.194855, 9.411000, 9…
## $ hc      <dbl> 2.667302, 2.674344, 2.681403, 2.688482, 2.695580, 2.702696, 2…
## $ rgdpna  <dbl> 119510.4, 122550.0, 117533.8, 130284.5, 140700.2, 146249.9, 1…
## $ labsh   <dbl> 0.6804925, 0.6804925, 0.6804925, 0.6804925, 0.6804925, 0.6804…
pwt_sample %>% summary()
##    country            isocode               year           pop          
##  Length:1428        Length:1428        Min.   :1950   Min.   :  0.1432  
##  Class :character   Class :character   1st Qu.:1967   1st Qu.:  7.3530  
##  Mode  :character   Mode  :character   Median :1984   Median : 11.2006  
##                                        Mean   :1984   Mean   : 36.8008  
##                                        3rd Qu.:2000   3rd Qu.: 52.7539  
##                                        Max.   :2017   Max.   :324.4595  
##                                                       NA's   :2         
##        hc            rgdpna             labsh       
##  Min.   :1.242   Min.   :    1098   Min.   :0.3286  
##  1st Qu.:2.440   1st Qu.:  137609   1st Qu.:0.5761  
##  Median :2.809   Median :  302889   Median :0.6313  
##  Mean   :2.784   Mean   : 1044426   Mean   :0.6137  
##  3rd Qu.:3.165   3rd Qu.: 1021393   3rd Qu.:0.6565  
##  Max.   :3.758   Max.   :17711024   Max.   :0.7701  
##  NA's   :2       NA's   :2          NA's   :2

select()

select() is useful for basic (but important) data management. You can use it to grab (or omit) columns from data. For example, let’s say I wanted to grab all the columns in the data. I could do that with the following command.

pwt_sample %>% select(everything())  # grab everything
## # A tibble: 1,428 x 7
##    country   isocode  year   pop    hc  rgdpna labsh
##    <chr>     <chr>   <dbl> <dbl> <dbl>   <dbl> <dbl>
##  1 Australia AUS      1950  8.39  2.67 119510. 0.680
##  2 Australia AUS      1951  8.63  2.67 122550. 0.680
##  3 Australia AUS      1952  8.82  2.68 117534. 0.680
##  4 Australia AUS      1953  8.99  2.69 130285. 0.680
##  5 Australia AUS      1954  9.19  2.70 140700. 0.680
##  6 Australia AUS      1955  9.41  2.70 146250. 0.680
##  7 Australia AUS      1956  9.64  2.71 146586. 0.680
##  8 Australia AUS      1957  9.85  2.72 149796. 0.680
##  9 Australia AUS      1958 10.1   2.73 159957. 0.680
## 10 Australia AUS      1959 10.3   2.74 169756. 0.680
## # … with 1,418 more rows

Here’s if I wanted everything except wanted to drop the labor share of income variable.

pwt_sample %>% select(-labsh) # grab everything, but drop the labsh variable.
## # A tibble: 1,428 x 6
##    country   isocode  year   pop    hc  rgdpna
##    <chr>     <chr>   <dbl> <dbl> <dbl>   <dbl>
##  1 Australia AUS      1950  8.39  2.67 119510.
##  2 Australia AUS      1951  8.63  2.67 122550.
##  3 Australia AUS      1952  8.82  2.68 117534.
##  4 Australia AUS      1953  8.99  2.69 130285.
##  5 Australia AUS      1954  9.19  2.70 140700.
##  6 Australia AUS      1955  9.41  2.70 146250.
##  7 Australia AUS      1956  9.64  2.71 146586.
##  8 Australia AUS      1957  9.85  2.72 149796.
##  9 Australia AUS      1958 10.1   2.73 159957.
## 10 Australia AUS      1959 10.3   2.74 169756.
## # … with 1,418 more rows

Here’s a more typical case. Assume you’re working with a large data object and you just want a handful of things. In this case, we have all these economic data on these 21 countries (ed. we really don’t, but roll with it), but we just want the GDP data along with the important identifying information for country and year. Here’s how we’d do that in the select() function, again with some assistance from the pipe.

pwt_sample %>% select(country, year, rgdpna) # grab just these three columns.
## # A tibble: 1,428 x 3
##    country    year  rgdpna
##    <chr>     <dbl>   <dbl>
##  1 Australia  1950 119510.
##  2 Australia  1951 122550.
##  3 Australia  1952 117534.
##  4 Australia  1953 130285.
##  5 Australia  1954 140700.
##  6 Australia  1955 146250.
##  7 Australia  1956 146586.
##  8 Australia  1957 149796.
##  9 Australia  1958 159957.
## 10 Australia  1959 169756.
## # … with 1,418 more rows

group_by()

I think the pipe is probably the most important function in the {tidyverse} even as a critical reader might note that the pipe is 1) a port from another package ({magrittr}) and 2) now a part of base R in a different terminology. Thus, the critical reader (and probably me, depending on my mood) may note that group_by() is probably the most important component of the {tidyverse}. Basically, group_by() allows you to “split” the data into various subsets, “apply” various functions to them, and “combine” them into one output. You might see that terminology “split-apply-combine” as you learn more about the {tidyverse} and its development.

Here, let’s do a simple group_by() exercise, while also introducing you to another function: slice(). We’re going to group by country in pwt_sample and “slice” the first observation for each group/country. Notice how we can chain these together with a pipe operator.

# Notice we can chain some pipes together
pwt_sample %>%
  # group by country
  group_by(country) %>%
  # Get me the first observation, by group.
  slice(1)
## # A tibble: 21 x 7
## # Groups:   country [21]
##    country   isocode  year   pop    hc  rgdpna  labsh
##    <chr>     <chr>   <dbl> <dbl> <dbl>   <dbl>  <dbl>
##  1 Australia AUS      1950  8.39  2.67 119510.  0.680
##  2 Austria   AUT      1950  6.98  2.55  47147.  0.637
##  3 Belgium   BEL      1950  8.63  2.20  76035.  0.651
##  4 Canada    CAN      1950 13.8   2.48 179072.  0.768
##  5 Chile     CHL      1950 NA    NA        NA  NA    
##  6 Denmark   DNK      1950  4.27  2.84  51441.  0.645
##  7 Finland   FIN      1950  4.01  2.12  27678.  0.669
##  8 France    FRA      1950 42.6   2.18 333156.  0.685
##  9 Germany   DEU      1950 68.7   2.43 442402.  0.672
## 10 Greece    GRC      1950 NA    NA        NA  NA    
## # … with 11 more rows

If you don’t group-by the country first, slice(., 1) will just return the first observation in the data set.

pwt_sample %>%
  # Get me the first observation for each country
  slice(1) # womp womp. Forgot to group_by()
## # A tibble: 1 x 7
##   country   isocode  year   pop    hc  rgdpna labsh
##   <chr>     <chr>   <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 Australia AUS      1950  8.39  2.67 119510. 0.680

I offer one caveat here. If you’re applying a group-specific function (that you need just once), it’s generally advisable to “ungroup()” (i.e. ungroup()) as the next function in your pipe chain. As you build together chains/pipes, the intermediate output you get will advise you of any “groups” you’ve declared in your data.

summarize()

summarize() creates condensed summaries of your data, for whatever it is that you want. Here, for example, is a kind of dumb way of seeing how many observations are in the data. nrow(pwt_sample) works just as well, but alas…

pwt_sample %>%
  # How many observations are in the data?
  summarize(n = n())
## # A tibble: 1 x 1
##       n
##   <int>
## 1  1428

More importantly, summarize() works wonderfully with group_by(). For example, for each country (group_by(country)), let’s get the maximum GDP observed in the data.

pwt_sample %>%
  group_by(country) %>%
  # Give me the max real GDP observed in the data.
  summarize(maxgdp = max(rgdpna, na.rm=T))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 21 x 2
##    country     maxgdp
##    <chr>        <dbl>
##  1 Australia 1215688 
##  2 Austria    380620.
##  3 Belgium    453158.
##  4 Canada    1647159.
##  5 Chile      399417.
##  6 Denmark    274272.
##  7 Finland    217679.
##  8 France    2565994.
##  9 Germany   3805884 
## 10 Greece     325783.
## # … with 11 more rows

One downside (or feature, depending on your perspective) to summarize() is that it condenses data and discards stuff that’s not necessary for creating the condensed output. In the case above, notice we didn’t ask for what year we observed the maximum GDP for a given country. We just asked for the maximum. If you wanted something that would also tell you what year that particular observation was, you’ll probably want a slice() command in lieu of summarize(). Observe:

pwt_sample %>%
  group_by(country) %>%
  slice(which(rgdpna == max(rgdpna, na.rm=T)))
## # A tibble: 21 x 7
## # Groups:   country [21]
##    country   isocode  year   pop    hc   rgdpna labsh
##    <chr>     <chr>   <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 Australia AUS      2017 24.5   3.52 1215688  0.586
##  2 Austria   AUT      2017  8.74  3.36  380620. 0.573
##  3 Belgium   BEL      2017 11.4   3.14  453158. 0.610
##  4 Canada    CAN      2017 36.6   3.71 1647159. 0.651
##  5 Chile     CHL      2017 18.1   3.11  399417. 0.440
##  6 Denmark   DNK      2017  5.73  3.56  274272. 0.613
##  7 Finland   FIN      2008  5.32  3.29  217679. 0.575
##  8 France    FRA      2017 67.2   3.19 2565994. 0.632
##  9 Germany   DEU      2017 82.1   3.67 3805884  0.618
## 10 Greece    GRC      2007 11.4   2.88  325783. 0.542
## # … with 11 more rows

This is a convoluted way of thinking about summarize(), but you’ll probably find yourself using it a lot.

mutate()

mutate() is probably the most important {tidyverse} function for data management/recoding. It will allow you to create new columns while retaining the original dimensions of the data. Consider it the sister function to summarize(). But, where summarize() discards, mutate() retains.

Let’s do something simple with mutate(). For example, the rgdpna column is real GDP in million 2011 USD. What if we wanted to convert that million to billions? This is simple with mutate(). Helpfully, you can create a new column that has both the original/raw data and a new/recoded variable. This is great for reproducibility in your data management.

pwt_sample %>%
  # Convert rgdpna from real GDP in millions to real GDP in billions
  mutate(rgdpnab = rgdpna/1000)
## # A tibble: 1,428 x 8
##    country   isocode  year   pop    hc  rgdpna labsh rgdpnab
##    <chr>     <chr>   <dbl> <dbl> <dbl>   <dbl> <dbl>   <dbl>
##  1 Australia AUS      1950  8.39  2.67 119510. 0.680    120.
##  2 Australia AUS      1951  8.63  2.67 122550. 0.680    123.
##  3 Australia AUS      1952  8.82  2.68 117534. 0.680    118.
##  4 Australia AUS      1953  8.99  2.69 130285. 0.680    130.
##  5 Australia AUS      1954  9.19  2.70 140700. 0.680    141.
##  6 Australia AUS      1955  9.41  2.70 146250. 0.680    146.
##  7 Australia AUS      1956  9.64  2.71 146586. 0.680    147.
##  8 Australia AUS      1957  9.85  2.72 149796. 0.680    150.
##  9 Australia AUS      1958 10.1   2.73 159957. 0.680    160.
## 10 Australia AUS      1959 10.3   2.74 169756. 0.680    170.
## # … with 1,418 more rows

Notice that mutate() also works beautifully with group_by(). For example, let’s assume—for whatever reason—we wanted a new variable that divided the real GDP variable over the observed maximum real GDP for a given country. I don’t know know why we’d want that, but we could do it with group_by().

pwt_sample %>%
  group_by(country) %>%
  # divide rgdpna over the country's max, for some reason.
  mutate(rgdpnaprop = rgdpna/max(rgdpna, na.rm=T))
## # A tibble: 1,428 x 8
## # Groups:   country [21]
##    country   isocode  year   pop    hc  rgdpna labsh rgdpnaprop
##    <chr>     <chr>   <dbl> <dbl> <dbl>   <dbl> <dbl>      <dbl>
##  1 Australia AUS      1950  8.39  2.67 119510. 0.680     0.0983
##  2 Australia AUS      1951  8.63  2.67 122550. 0.680     0.101 
##  3 Australia AUS      1952  8.82  2.68 117534. 0.680     0.0967
##  4 Australia AUS      1953  8.99  2.69 130285. 0.680     0.107 
##  5 Australia AUS      1954  9.19  2.70 140700. 0.680     0.116 
##  6 Australia AUS      1955  9.41  2.70 146250. 0.680     0.120 
##  7 Australia AUS      1956  9.64  2.71 146586. 0.680     0.121 
##  8 Australia AUS      1957  9.85  2.72 149796. 0.680     0.123 
##  9 Australia AUS      1958 10.1   2.73 159957. 0.680     0.132 
## 10 Australia AUS      1959 10.3   2.74 169756. 0.680     0.140 
## # … with 1,418 more rows

filter()

filter() is a great diagnostic tool for subsetting your data to look at particular observations. Notice one little thing, especially if you’re new to programming. The use of double-equal signs (==) is for making logical statements where as single-equal signs (=) is for object assignment or column creation. If you’re using filter(), you’re probably wanting to find cases where something equals something (==), is greater than something (>), equal to or greater than something (>=), is less than something (<), or is less than or equal to something (<=).

Here, let’s grab just the American observations by filtering to where isocode == “USA”.

pwt_sample %>%
  # give me just the USA observations
  filter(isocode == "USA")
## # A tibble: 68 x 7
##    country                  isocode  year   pop    hc   rgdpna labsh
##    <chr>                    <chr>   <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 United States of America USA      1950  156.  2.58 2246944. 0.628
##  2 United States of America USA      1951  158.  2.60 2428017  0.634
##  3 United States of America USA      1952  161.  2.61 2526887. 0.645
##  4 United States of America USA      1953  164.  2.62 2645510. 0.644
##  5 United States of America USA      1954  167.  2.63 2630592. 0.637
##  6 United States of America USA      1955  170.  2.65 2817940  0.627
##  7 United States of America USA      1956  173.  2.66 2878023  0.640
##  8 United States of America USA      1957  176.  2.68 2938621. 0.639
##  9 United States of America USA      1958  179.  2.69 2917016. 0.636
## 10 United States of America USA      1959  182.  2.71 3118356. 0.629
## # … with 58 more rows

We could also use filter() to select observations from the most recent year.

pwt_sample %>%
  # give me the observations from the most recent year.
  filter(year == max(year))
## # A tibble: 21 x 7
##    country     isocode  year   pop    hc   rgdpna labsh
##    <chr>       <chr>   <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 Australia   AUS      2017 24.5   3.52 1215688  0.586
##  2 Austria     AUT      2017  8.74  3.36  380620. 0.573
##  3 Belgium     BEL      2017 11.4   3.14  453158. 0.610
##  4 Canada      CAN      2017 36.6   3.71 1647159. 0.651
##  5 Switzerland CHE      2017  8.48  3.69  527023. 0.650
##  6 Chile       CHL      2017 18.1   3.11  399417. 0.440
##  7 Germany     DEU      2017 82.1   3.67 3805884  0.618
##  8 Denmark     DNK      2017  5.73  3.56  274272. 0.613
##  9 Spain       ESP      2017 46.4   2.94 1557162. 0.574
## 10 Finland     FIN      2017  5.52  3.47  216303. 0.576
## # … with 11 more rows

Don’t Forget to Assign!

When you’re done applying functions/doing whatever to your data, don’t forget to assign what you’ve done to an object. For simple cases, and for beginners, I recommend thinking “left-handed” and using <- for object assignment (as we did above). When you’re doing stuff in the pipe, my “left-handed” thinking prioritizes the starting data in the pipe chain. Thus, I tend to use -> for object assignment at the end of the pipe.

Consider a simple example below. I’m starting with the original data (pwt_sample). I’m using a simple pipe to create a new variable (within mutate()) that standardizes the real GDP variable from millions to billions. Afterward, I’m assigning it to a new object (Data) with ->.

pwt_sample %>%
  # convert real GDP to billions
  mutate(rgdpnab = rgdpna/1000) -> Data