In this lab you can use the interactive console to explore but please record your commands here. Remember anything you type here can be “sent” to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) (But only inside the code chunks designated with the {r} areas).

Part 1

Now we will work with the county_pop data from the same case study.

# don't forget to load the packages that you will need!
library(tidyverse)

Import the data from “https://jhudatascience.org/intro_to_r/data/county_pop.csv” using read_csv and assign the data to an object called county_pop. Check that it worked by seeing if you have the county_pop data.

county_pop <- read_csv("https://jhudatascience.org/intro_to_r/data/county_pop.csv")

## Rows: 28265 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): BUYER_COUNTY, BUYER_STATE, countyfips, county_name, NAME, variable
## dbl (4): STATE, COUNTY, year, population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(county_pop)

## # A tibble: 6 × 10
##   BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME     variable
##   <chr>        <chr>       <chr>      <dbl>  <dbl> <chr>       <chr>    <chr>   
## 1 AUTAUGA      AL          01001          1      1 Autauga     Autauga… B01003_…
## 2 BALDWIN      AL          01003          1      3 Baldwin     Baldwin… B01003_…
## 3 BARBOUR      AL          01005          1      5 Barbour     Barbour… B01003_…
## 4 BIBB         AL          01007          1      7 Bibb        Bibb Co… B01003_…
## 5 BLOUNT       AL          01009          1      9 Blount      Blount … B01003_…
## 6 BULLOCK      AL          01011          1     11 Bullock     Bullock… B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>

1.1

What class is county_pop?

class(county_pop)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

1.2

How many observations (rows) and variables (columns) are in the dataset - try the dim() function?

dim(county_pop)

## [1] 28265    10

nrow(county_pop)

## [1] 28265

ncol(county_pop)

## [1] 10

1.3

Next, rename the column BUYER_STATE to be State_name (hint - use rename() and watch out for the order of the new and old names!).

county_pop <- rename(county_pop, State_name = BUYER_STATE)
head(county_pop)

## # A tibble: 6 × 10
##   BUYER_COUNTY State_name countyfips STATE COUNTY county_name NAME      variable
##   <chr>        <chr>      <chr>      <dbl>  <dbl> <chr>       <chr>     <chr>   
## 1 AUTAUGA      AL         01001          1      1 Autauga     Autauga … B01003_…
## 2 BALDWIN      AL         01003          1      3 Baldwin     Baldwin … B01003_…
## 3 BARBOUR      AL         01005          1      5 Barbour     Barbour … B01003_…
## 4 BIBB         AL         01007          1      7 Bibb        Bibb Cou… B01003_…
## 5 BLOUNT       AL         01009          1      9 Blount      Blount C… B01003_…
## 6 BULLOCK      AL         01011          1     11 Bullock     Bullock … B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>

1.4

Convert the column names of county_pop to be all lower case. Use rename_with(), and the tolower command.

county_pop <- rename_with(county_pop, tolower)
head(county_pop)

## # A tibble: 6 × 10
##   buyer_county state_name countyfips state county county_name name      variable
##   <chr>        <chr>      <chr>      <dbl>  <dbl> <chr>       <chr>     <chr>   
## 1 AUTAUGA      AL         01001          1      1 Autauga     Autauga … B01003_…
## 2 BALDWIN      AL         01003          1      3 Baldwin     Baldwin … B01003_…
## 3 BARBOUR      AL         01005          1      5 Barbour     Barbour … B01003_…
## 4 BIBB         AL         01007          1      7 Bibb        Bibb Cou… B01003_…
## 5 BLOUNT       AL         01009          1      9 Blount      Blount C… B01003_…
## 6 BULLOCK      AL         01011          1     11 Bullock     Bullock … B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>

dim(county_pop)

## [1] 28265    10

Practice on Your Own!

P.1

How can you show the first 3 rows and the last 3 rows of county_pop (in two lines of code)?

head(county_pop, 3)

## # A tibble: 3 × 10
##   buyer_county state_name countyfips state county county_name name      variable
##   <chr>        <chr>      <chr>      <dbl>  <dbl> <chr>       <chr>     <chr>   
## 1 AUTAUGA      AL         01001          1      1 Autauga     Autauga … B01003_…
## 2 BALDWIN      AL         01003          1      3 Baldwin     Baldwin … B01003_…
## 3 BARBOUR      AL         01005          1      5 Barbour     Barbour … B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>

tail(county_pop, 3)

## # A tibble: 3 × 10
##   buyer_county  state_name countyfips state county county_name   name   variable
##   <chr>         <chr>      <chr>      <dbl>  <dbl> <chr>         <chr>  <chr>   
## 1 SKAGWAY       AK         02230          2    230 Skagway       Skagw… B01003_…
## 2 HOONAH ANGOON AK         02105          2    105 Hoonah Angoon Hoona… B01003_…
## 3 PETERSBURG    AK         02195          2    195 Petersburg    Peter… B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>

Part 2

2.1

Create a subset of the county_pop that only contains the columns: county_name, year, and population and assign this object to pop_sub - what are the dimensions of this dataset?

pop_sub <- select(county_pop, county_name, year, population)
dim(pop_sub)

## [1] 28265     3

2.2

Start with county_pop again instead of the dataset you just made. Subset the data to only include the population column and the columns that end with “_name”. Hint: use select() and ends_with(). Assign this subset of the data to be pop2. Again take a look at the data and check the dimensions.

pop2 <- select(county_pop, population, ends_with("_name"))
pop2

## # A tibble: 28,265 × 3
##    population state_name county_name
##         <dbl> <chr>      <chr>      
##  1      51328 AL         Autauga    
##  2     168121 AL         Baldwin    
##  3      27861 AL         Barbour    
##  4      22099 AL         Bibb       
##  5      55485 AL         Blount     
##  6      10776 AL         Bullock    
##  7      20815 AL         Butler     
##  8     115388 AL         Calhoun    
##  9      34945 AL         Chambers   
## 10      25466 AL         Cherokee   
## # ℹ 28,255 more rows

dim(pop2)

## [1] 28265     3

2.3

Pull the variable county_name from pop2. How does this differ form selecting it? Use head() to take a look at both options.

head(pull(pop2, county_name))

## [1] "Autauga" "Baldwin" "Barbour" "Bibb"    "Blount"  "Bullock"

head(select(pop2, county_name))

## # A tibble: 6 × 1
##   county_name
##   <chr>      
## 1 Autauga    
## 2 Baldwin    
## 3 Barbour    
## 4 Bibb       
## 5 Blount     
## 6 Bullock

2.4

Subset the rows of county_pop that have more than 100000 for population - how many rows are there? Use filter().

county_pop_sub <- filter(county_pop, population > 100000)
nrow(county_pop_sub)

## [1] 5159

2.5

Subset the rows of county_pop that have a population of more than 100000 and have a year value less than 2010 - how many are there?

filter(county_pop,  population > 100000  & year < 2010) # all of these options work

## # A tibble: 2,260 × 10
##    buyer_county state_name countyfips state county county_name name     variable
##    <chr>        <chr>      <chr>      <dbl>  <dbl> <chr>       <chr>    <chr>   
##  1 BALDWIN      AL         01003          1      3 Baldwin     Baldwin… B01003_…
##  2 CALHOUN      AL         01015          1     15 Calhoun     Calhoun… B01003_…
##  3 ETOWAH       AL         01055          1     55 Etowah      Etowah … B01003_…
##  4 JEFFERSON    AL         01073          1     73 Jefferson   Jeffers… B01003_…
##  5 LEE          AL         01081          1     81 Lee         Lee Cou… B01003_…
##  6 MADISON      AL         01089          1     89 Madison     Madison… B01003_…
##  7 MOBILE       AL         01097          1     97 Mobile      Mobile … B01003_…
##  8 MONTGOMERY   AL         01101          1    101 Montgomery  Montgom… B01003_…
##  9 MORGAN       AL         01103          1    103 Morgan      Morgan … B01003_…
## 10 SHELBY       AL         01117          1    117 Shelby      Shelby … B01003_…
## # ℹ 2,250 more rows
## # ℹ 2 more variables: year <dbl>, population <dbl>

nrow(filter(county_pop,  population > 100000 & year < 2010 ))

## [1] 2260

nrow(filter(county_pop,  population > 100000, year < 2010))

## [1] 2260

2.6

Subset the rows of county_pop that have a population of more than 100000 and a year value of less than or equal to 2010 - how many are there?

filter(county_pop, population > 100000 & year <= 2010) # all of these options work

## # A tibble: 2,834 × 10
##    buyer_county state_name countyfips state county county_name name     variable
##    <chr>        <chr>      <chr>      <dbl>  <dbl> <chr>       <chr>    <chr>   
##  1 BALDWIN      AL         01003          1      3 Baldwin     Baldwin… B01003_…
##  2 CALHOUN      AL         01015          1     15 Calhoun     Calhoun… B01003_…
##  3 ETOWAH       AL         01055          1     55 Etowah      Etowah … B01003_…
##  4 JEFFERSON    AL         01073          1     73 Jefferson   Jeffers… B01003_…
##  5 LEE          AL         01081          1     81 Lee         Lee Cou… B01003_…
##  6 MADISON      AL         01089          1     89 Madison     Madison… B01003_…
##  7 MOBILE       AL         01097          1     97 Mobile      Mobile … B01003_…
##  8 MONTGOMERY   AL         01101          1    101 Montgomery  Montgom… B01003_…
##  9 MORGAN       AL         01103          1    103 Morgan      Morgan … B01003_…
## 10 SHELBY       AL         01117          1    117 Shelby      Shelby … B01003_…
## # ℹ 2,824 more rows
## # ℹ 2 more variables: year <dbl>, population <dbl>

nrow(filter(county_pop, population > 100000 & year <= 2010))

## [1] 2834

nrow(filter(county_pop, population > 100000, year <= 2010))

## [1] 2834

2.7

Why do the answers for 2.5 and 2.6 differ?

# They differ because there are rows with values that are exactly 2010.

Practice on Your Own!

P.2

Subset the rows of county_pop for rows that have state_name of NM, or less than 10000 population. How many rows have both?

filter(county_pop, state_name == "NM" | population < 10000)

## # A tibble: 6,526 × 10
##    buyer_county    state_name countyfips state county county_name name  variable
##    <chr>           <chr>      <chr>      <dbl>  <dbl> <chr>       <chr> <chr>   
##  1 GREENE          AL         01063          1     63 Greene      Gree… B01003_…
##  2 ALEUTIANS EAST  AK         02013          2     13 Aleutians … Aleu… B01003_…
##  3 ALEUTIANS WEST  AK         02016          2     16 Aleutians … Aleu… B01003_…
##  4 BRISTOL BAY     AK         02060          2     60 Bristol Bay Bris… B01003_…
##  5 DENALI          AK         02068          2     68 Denali      Dena… B01003_…
##  6 DILLINGHAM      AK         02070          2     70 Dillingham  Dill… B01003_…
##  7 HAINES          AK         02100          2    100 Haines      Hain… B01003_…
##  8 LAKE AND PENIN… AK         02164          2    164 Lake and P… Lake… B01003_…
##  9 NOME            AK         02180          2    180 Nome        Nome… B01003_…
## 10 NORTH SLOPE     AK         02185          2    185 North Slope Nort… B01003_…
## # ℹ 6,516 more rows
## # ℹ 2 more variables: year <dbl>, population <dbl>

nrow(filter(county_pop, state_name == "NM" & population < 10000))

## [1] 72

P.3

Select the variables that contain the letter “a” from cars_mt_copy.

select(county_pop, contains("a"))

## # A tibble: 28,265 × 7
##    state_name state county_name name                   variable  year population
##    <chr>      <dbl> <chr>       <chr>                  <chr>    <dbl>      <dbl>
##  1 AL             1 Autauga     Autauga County, Alaba… B01003_…  2006      51328
##  2 AL             1 Baldwin     Baldwin County, Alaba… B01003_…  2006     168121
##  3 AL             1 Barbour     Barbour County, Alaba… B01003_…  2006      27861
##  4 AL             1 Bibb        Bibb County, Alabama   B01003_…  2006      22099
##  5 AL             1 Blount      Blount County, Alabama B01003_…  2006      55485
##  6 AL             1 Bullock     Bullock County, Alaba… B01003_…  2006      10776
##  7 AL             1 Butler      Butler County, Alabama B01003_…  2006      20815
##  8 AL             1 Calhoun     Calhoun County, Alaba… B01003_…  2006     115388
##  9 AL             1 Chambers    Chambers County, Alab… B01003_…  2006      34945
## 10 AL             1 Cherokee    Cherokee County, Alab… B01003_…  2006      25466
## # ℹ 28,255 more rows

Part 3

3.1

Create a subset called county_2012 from county_pop that only contains the rows for the year 2012 and only the columns: county_name, state_name, population. The column year should not be included in county_2012. What are the dimensions of this dataset? Don’t use pipes (%>%) and instead do this in two steps creating the county_2012 object with filter and updating it with select.

county_2012 <- filter(county_pop, year == 2012)
county_2012  <- select(county_2012, county_name, state_name, population)
county_2012 # can get dimensions just from viewing the tibble

## # A tibble: 3,140 × 3
##    county_name state_name population
##    <chr>       <chr>           <dbl>
##  1 Autauga     AL              54590
##  2 Baldwin     AL             183226
##  3 Barbour     AL              27469
##  4 Bibb        AL              22769
##  5 Blount      AL              57466
##  6 Bullock     AL              10779
##  7 Butler      AL              20730
##  8 Calhoun     AL             117834
##  9 Chambers    AL              34228
## 10 Cherokee    AL              25917
## # ℹ 3,130 more rows

dim(county_2012) # alternatively can use dim() function

## [1] 3140    3

3.2

Try the same thing again with pipes %>%.

county_2012 <- county_pop %>%
  filter(year == 2012) %>%
  select(county_name, state_name, population)
county_2012

## # A tibble: 3,140 × 3
##    county_name state_name population
##    <chr>       <chr>           <dbl>
##  1 Autauga     AL              54590
##  2 Baldwin     AL             183226
##  3 Barbour     AL              27469
##  4 Bibb        AL              22769
##  5 Blount      AL              57466
##  6 Bullock     AL              10779
##  7 Butler      AL              20730
##  8 Calhoun     AL             117834
##  9 Chambers    AL              34228
## 10 Cherokee    AL              25917
## # ℹ 3,130 more rows

dim(county_2012)

## [1] 3140    3

What happens if you do the steps in a different order? Why does this not work?

#county_2012 <- county_pop %>%
#  select(county_name, state_name, population) %>%
#  filter(year == 2012)
# you get an error because there is now year variable to filter from

3.3

Use arrange() to order the rows of county_2012 by population in increasing order. What is county with the smallest population?

county_2012 <- arrange(county_2012, population)
county_2012

## # A tibble: 3,140 × 3
##    county_name state_name population
##    <chr>       <chr>           <dbl>
##  1 Loving      TX                 85
##  2 King        TX                227
##  3 McPherson   NE                348
##  4 Kenedy      TX                451
##  5 Arthur      NE                490
##  6 Loup        NE                534
##  7 Petroleum   MT                566
##  8 Blaine      NE                584
##  9 Grant       NE                608
## 10 Harding     NM                612
## # ℹ 3,130 more rows

arrange(county_2012, population) %>%
  head(n = 1) %>%
  pull(county_name)

## [1] "Loving"

3.4

Create a new variable in county_2012 called pop1000, which is equal to population divided by 1000, using mutate()(don’t forget to reassign county_2012). Use pipes %>%.

# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)

county_2012 <- county_2012 %>% mutate(pop1000 = population/1000)

Practice on Your Own!

P.4

Move the state_name column to be before county_name in the county_2012 dataset. Use relocate().

county_2012 <- county_2012 %>% relocate(state_name, .before = county_name)
county_2012

## # A tibble: 3,140 × 4
##    state_name county_name population pop1000
##    <chr>      <chr>            <dbl>   <dbl>
##  1 TX         Loving              85   0.085
##  2 TX         King               227   0.227
##  3 NE         McPherson          348   0.348
##  4 TX         Kenedy             451   0.451
##  5 NE         Arthur             490   0.49 
##  6 NE         Loup               534   0.534
##  7 MT         Petroleum          566   0.566
##  8 NE         Blaine             584   0.584
##  9 NE         Grant              608   0.608
## 10 NM         Harding            612   0.612
## # ℹ 3,130 more rows

P.5

How can you find the value of population for Chambers AL without just looking at the data manually and instead use functions we learned today?

county_2012 %>%
  filter(county_name == "Chambers", state_name == "AL") %>%
  select(population)

## # A tibble: 1 × 1
##   population
##        <dbl>
## 1      34228

Subsetting Data in R - Key

Part 1

1.1

1.2

1.3

1.4

Practice on Your Own!

P.1

Part 2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

Practice on Your Own!

P.2

P.3

Part 3

3.1

3.2

3.3

3.4

Practice on Your Own!

P.4

P.5