In this lab you can use the interactive console to explore but please record your commands here. Remember anything you type here can be “sent” to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) (But only inside the code chunks designated with the {r} areas).

Part 1

First let’s get the OCSdata package if we haven’t already. We added an argument here so you don’t have to select a mirror.

install.packages("OCSdata", repos='http://cran.us.r-project.org')
## Installing package into '/usr/local/lib/R/site-library'
## (as 'lib' is unspecified)
# don't forget to load the packages that you will need!
library(dplyr)
library(tidyverse)
library(OCSdata)

Now let’s load the opioid related datasets by running this chunk.

OCSdata::load_imported_data("ocs-bp-opioid-rural-urban")

Check that it worked by seeing if you have the county_pop data.

county_pop
## # A tibble: 28,265 × 10
##    BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME    variable
##    <chr>        <chr>       <chr>      <int>  <int> <chr>       <chr>   <chr>   
##  1 AUTAUGA      AL          01001          1      1 Autauga     Autaug… B01003_…
##  2 BALDWIN      AL          01003          1      3 Baldwin     Baldwi… B01003_…
##  3 BARBOUR      AL          01005          1      5 Barbour     Barbou… B01003_…
##  4 BIBB         AL          01007          1      7 Bibb        Bibb C… B01003_…
##  5 BLOUNT       AL          01009          1      9 Blount      Blount… B01003_…
##  6 BULLOCK      AL          01011          1     11 Bullock     Bulloc… B01003_…
##  7 BUTLER       AL          01013          1     13 Butler      Butler… B01003_…
##  8 CALHOUN      AL          01015          1     15 Calhoun     Calhou… B01003_…
##  9 CHAMBERS     AL          01017          1     17 Chambers    Chambe… B01003_…
## 10 CHEROKEE     AL          01019          1     19 Cherokee    Cherok… B01003_…
## # ℹ 28,255 more rows
## # ℹ 2 more variables: year <int>, population <int>

1.1

What class is county_pop?

class(county_pop)
## [1] "tbl_df"     "tbl"        "data.frame"

1.2

How many observations (rows) and variables (columns) are in the dataset - try the dim() function?

dim(county_pop)
## [1] 28265    10
nrow(county_pop)
## [1] 28265
ncol(county_pop)
## [1] 10

1.3

Next, rename the column BUYER_STATE to be State_name (hint - use rename() and watch out for the order of the new and old names!).

county_pop <- rename(county_pop, State_name = BUYER_STATE)
head(county_pop)
## # A tibble: 6 × 10
##   BUYER_COUNTY State_name countyfips STATE COUNTY county_name NAME      variable
##   <chr>        <chr>      <chr>      <int>  <int> <chr>       <chr>     <chr>   
## 1 AUTAUGA      AL         01001          1      1 Autauga     Autauga … B01003_…
## 2 BALDWIN      AL         01003          1      3 Baldwin     Baldwin … B01003_…
## 3 BARBOUR      AL         01005          1      5 Barbour     Barbour … B01003_…
## 4 BIBB         AL         01007          1      7 Bibb        Bibb Cou… B01003_…
## 5 BLOUNT       AL         01009          1      9 Blount      Blount C… B01003_…
## 6 BULLOCK      AL         01011          1     11 Bullock     Bullock … B01003_…
## # ℹ 2 more variables: year <int>, population <int>

1.4

Convert the column names of county_pop to be all lower case. Use rename_with(), and the tolower command.

county_pop <- rename_with(county_pop, tolower)
head(county_pop)
## # A tibble: 6 × 10
##   buyer_county state_name countyfips state county county_name name      variable
##   <chr>        <chr>      <chr>      <int>  <int> <chr>       <chr>     <chr>   
## 1 AUTAUGA      AL         01001          1      1 Autauga     Autauga … B01003_…
## 2 BALDWIN      AL         01003          1      3 Baldwin     Baldwin … B01003_…
## 3 BARBOUR      AL         01005          1      5 Barbour     Barbour … B01003_…
## 4 BIBB         AL         01007          1      7 Bibb        Bibb Cou… B01003_…
## 5 BLOUNT       AL         01009          1      9 Blount      Blount C… B01003_…
## 6 BULLOCK      AL         01011          1     11 Bullock     Bullock … B01003_…
## # ℹ 2 more variables: year <int>, population <int>
dim(county_pop)
## [1] 28265    10

Practice on Your Own!

P.1

How can you print the first 3 rows and the last 3 rows of county_pop (in two lines of code)?

head(county_pop, 3)
## # A tibble: 3 × 10
##   buyer_county state_name countyfips state county county_name name      variable
##   <chr>        <chr>      <chr>      <int>  <int> <chr>       <chr>     <chr>   
## 1 AUTAUGA      AL         01001          1      1 Autauga     Autauga … B01003_…
## 2 BALDWIN      AL         01003          1      3 Baldwin     Baldwin … B01003_…
## 3 BARBOUR      AL         01005          1      5 Barbour     Barbour … B01003_…
## # ℹ 2 more variables: year <int>, population <int>
tail(county_pop, 3)
## # A tibble: 3 × 10
##   buyer_county  state_name countyfips state county county_name   name   variable
##   <chr>         <chr>      <chr>      <int>  <int> <chr>         <chr>  <chr>   
## 1 SKAGWAY       AK         02230          2    230 Skagway       Skagw… B01003_…
## 2 HOONAH ANGOON AK         02105          2    105 Hoonah Angoon Hoona… B01003_…
## 3 PETERSBURG    AK         02195          2    195 Petersburg    Peter… B01003_…
## # ℹ 2 more variables: year <int>, population <int>

Part 2

2.1

Create a subset of the county_pop that only contains the columns: county_name, year, and population and assign this object to pop_sub - what are the dimensions of this dataset?

pop_sub <- select(county_pop, county_name, year, population)
dim(pop_sub)
## [1] 28265     3

2.2

Start with county_pop again instead of the dataset you just made. Subset the data to only include the population column and the columns that end with “_name”. Hint: use select() and ends_with(). Assign this subset of the data to be pop2. Again take a look at the data and check the dimensions.

pop2 <- select(county_pop, population, ends_with("_name"))
pop2
## # A tibble: 28,265 × 3
##    population state_name county_name
##         <int> <chr>      <chr>      
##  1      51328 AL         Autauga    
##  2     168121 AL         Baldwin    
##  3      27861 AL         Barbour    
##  4      22099 AL         Bibb       
##  5      55485 AL         Blount     
##  6      10776 AL         Bullock    
##  7      20815 AL         Butler     
##  8     115388 AL         Calhoun    
##  9      34945 AL         Chambers   
## 10      25466 AL         Cherokee   
## # ℹ 28,255 more rows
dim(pop2)
## [1] 28265     3

2.3

Pull the variable county_name from pop2. How does this differ form selecting it? Use head() to take a look at both options.

head(pull(pop2, county_name))
## [1] "Autauga" "Baldwin" "Barbour" "Bibb"    "Blount"  "Bullock"
head(select(pop2, county_name))
## # A tibble: 6 × 1
##   county_name
##   <chr>      
## 1 Autauga    
## 2 Baldwin    
## 3 Barbour    
## 4 Bibb       
## 5 Blount     
## 6 Bullock

2.4

Subset the rows of county_pop that have more than 100000 for population - how many rows are there? Use filter().

county_pop_sub <- filter(county_pop, population > 100000)
nrow(county_pop_sub)
## [1] 5159

2.5

Subset the rows of county_pop that have a year value less than 2010 and more than 100000 population - how many are there?

filter(county_pop, year < 2010 & population > 100000) # all of these options work
## # A tibble: 2,260 × 10
##    buyer_county state_name countyfips state county county_name name     variable
##    <chr>        <chr>      <chr>      <int>  <int> <chr>       <chr>    <chr>   
##  1 BALDWIN      AL         01003          1      3 Baldwin     Baldwin… B01003_…
##  2 CALHOUN      AL         01015          1     15 Calhoun     Calhoun… B01003_…
##  3 ETOWAH       AL         01055          1     55 Etowah      Etowah … B01003_…
##  4 JEFFERSON    AL         01073          1     73 Jefferson   Jeffers… B01003_…
##  5 LEE          AL         01081          1     81 Lee         Lee Cou… B01003_…
##  6 MADISON      AL         01089          1     89 Madison     Madison… B01003_…
##  7 MOBILE       AL         01097          1     97 Mobile      Mobile … B01003_…
##  8 MONTGOMERY   AL         01101          1    101 Montgomery  Montgom… B01003_…
##  9 MORGAN       AL         01103          1    103 Morgan      Morgan … B01003_…
## 10 SHELBY       AL         01117          1    117 Shelby      Shelby … B01003_…
## # ℹ 2,250 more rows
## # ℹ 2 more variables: year <int>, population <int>
nrow(filter(county_pop, year < 2010 & population > 100000))
## [1] 2260
nrow(filter(county_pop, year < 2010, population > 100000))
## [1] 2260

2.6

Subset the rows of county_pop that have a year value of less than or equal to 2010 and more than 100000 population - how many are there?

filter(county_pop, year <= 2010 & population > 100000) # all of these options work
## # A tibble: 2,834 × 10
##    buyer_county state_name countyfips state county county_name name     variable
##    <chr>        <chr>      <chr>      <int>  <int> <chr>       <chr>    <chr>   
##  1 BALDWIN      AL         01003          1      3 Baldwin     Baldwin… B01003_…
##  2 CALHOUN      AL         01015          1     15 Calhoun     Calhoun… B01003_…
##  3 ETOWAH       AL         01055          1     55 Etowah      Etowah … B01003_…
##  4 JEFFERSON    AL         01073          1     73 Jefferson   Jeffers… B01003_…
##  5 LEE          AL         01081          1     81 Lee         Lee Cou… B01003_…
##  6 MADISON      AL         01089          1     89 Madison     Madison… B01003_…
##  7 MOBILE       AL         01097          1     97 Mobile      Mobile … B01003_…
##  8 MONTGOMERY   AL         01101          1    101 Montgomery  Montgom… B01003_…
##  9 MORGAN       AL         01103          1    103 Morgan      Morgan … B01003_…
## 10 SHELBY       AL         01117          1    117 Shelby      Shelby … B01003_…
## # ℹ 2,824 more rows
## # ℹ 2 more variables: year <int>, population <int>
nrow(filter(county_pop, year <= 2010 & population > 100000))
## [1] 2834
nrow(filter(county_pop, year <= 2010, population > 100000))
## [1] 2834

2.7

Why do the answers for 2.5 and 2.6 differ?

# They differ because there are rows with values that are exactly 2010.

Practice on Your Own!

P.2

Subset the rows of county_pop for rows that have state_name of NM, or less than 10000 population. How many rows have both?

filter(county_pop, state_name == "NM" | population < 10000)
## # A tibble: 6,526 × 10
##    buyer_county    state_name countyfips state county county_name name  variable
##    <chr>           <chr>      <chr>      <int>  <int> <chr>       <chr> <chr>   
##  1 GREENE          AL         01063          1     63 Greene      Gree… B01003_…
##  2 ALEUTIANS EAST  AK         02013          2     13 Aleutians … Aleu… B01003_…
##  3 ALEUTIANS WEST  AK         02016          2     16 Aleutians … Aleu… B01003_…
##  4 BRISTOL BAY     AK         02060          2     60 Bristol Bay Bris… B01003_…
##  5 DENALI          AK         02068          2     68 Denali      Dena… B01003_…
##  6 DILLINGHAM      AK         02070          2     70 Dillingham  Dill… B01003_…
##  7 HAINES          AK         02100          2    100 Haines      Hain… B01003_…
##  8 LAKE AND PENIN… AK         02164          2    164 Lake and P… Lake… B01003_…
##  9 NOME            AK         02180          2    180 Nome        Nome… B01003_…
## 10 NORTH SLOPE     AK         02185          2    185 North Slope Nort… B01003_…
## # ℹ 6,516 more rows
## # ℹ 2 more variables: year <int>, population <int>
nrow(filter(county_pop, state_name == "NM" & population < 10000))
## [1] 72

P.3

Select the variables that contain the letter “a” from cars_mt_copy.

select(county_pop, contains("a"))
## # A tibble: 28,265 × 7
##    state_name state county_name name                   variable  year population
##    <chr>      <int> <chr>       <chr>                  <chr>    <int>      <int>
##  1 AL             1 Autauga     Autauga County, Alaba… B01003_…  2006      51328
##  2 AL             1 Baldwin     Baldwin County, Alaba… B01003_…  2006     168121
##  3 AL             1 Barbour     Barbour County, Alaba… B01003_…  2006      27861
##  4 AL             1 Bibb        Bibb County, Alabama   B01003_…  2006      22099
##  5 AL             1 Blount      Blount County, Alabama B01003_…  2006      55485
##  6 AL             1 Bullock     Bullock County, Alaba… B01003_…  2006      10776
##  7 AL             1 Butler      Butler County, Alabama B01003_…  2006      20815
##  8 AL             1 Calhoun     Calhoun County, Alaba… B01003_…  2006     115388
##  9 AL             1 Chambers    Chambers County, Alab… B01003_…  2006      34945
## 10 AL             1 Cherokee    Cherokee County, Alab… B01003_…  2006      25466
## # ℹ 28,255 more rows

Part 3

3.1

Create a subset called county_2012 from county_pop that only contains the rows for the year 2012 and only the columns: county_name, state_name, population. year should not be included in county_sub. What are the dimensions of this dataset? Don’t use pipes (%>%) and instead do this in two steps creating the county_2012 object with filter and updating it with select.

county_2012 <- filter(county_pop, year == 2012)
county_2012  <- select(county_2012, county_name, state_name, population)
county_2012 # can get dimensions just from viewing the tibble
## # A tibble: 3,140 × 3
##    county_name state_name population
##    <chr>       <chr>           <int>
##  1 Autauga     AL              54590
##  2 Baldwin     AL             183226
##  3 Barbour     AL              27469
##  4 Bibb        AL              22769
##  5 Blount      AL              57466
##  6 Bullock     AL              10779
##  7 Butler      AL              20730
##  8 Calhoun     AL             117834
##  9 Chambers    AL              34228
## 10 Cherokee    AL              25917
## # ℹ 3,130 more rows
dim(county_2012) # alternatively can use dim() function
## [1] 3140    3

3.2

Try the same thing again with pipes %>%.

county_2012 <- county_pop %>%
  filter(year == 2012) %>%
  select(county_name, state_name, population)
county_2012
## # A tibble: 3,140 × 3
##    county_name state_name population
##    <chr>       <chr>           <int>
##  1 Autauga     AL              54590
##  2 Baldwin     AL             183226
##  3 Barbour     AL              27469
##  4 Bibb        AL              22769
##  5 Blount      AL              57466
##  6 Bullock     AL              10779
##  7 Butler      AL              20730
##  8 Calhoun     AL             117834
##  9 Chambers    AL              34228
## 10 Cherokee    AL              25917
## # ℹ 3,130 more rows
dim(county_2012)
## [1] 3140    3

What happens if you do the steps in a different order? Why does this not work?

#county_2012 <- county_pop %>%
#  select(county_name, state_name, population) %>%
#  filter(year == 2012)
# you get an error because there is now year variable to filter from

1.3

Re-order the rows of county_2012 by population in increasing order. Use arrange(). What is county with the smallest population?

county_2012 <- arrange(county_2012, population)
county_2012
## # A tibble: 3,140 × 3
##    county_name state_name population
##    <chr>       <chr>           <int>
##  1 Loving      TX                 85
##  2 King        TX                227
##  3 McPherson   NE                348
##  4 Kenedy      TX                451
##  5 Arthur      NE                490
##  6 Loup        NE                534
##  7 Petroleum   MT                566
##  8 Blaine      NE                584
##  9 Grant       NE                608
## 10 Harding     NM                612
## # ℹ 3,130 more rows
arrange(county_2012, population) %>%
  head(n = 1) %>%
  pull(county_name)
## [1] "Loving"

1.4

Create a new variable in county_2012 called pop1000, which is equal to population divided by 1000, using mutate()(don’t forget to reassign county_2012). Use pipes %>%.

# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
county_2012 <- county_2012 %>% mutate(pop1000 = population/1000)

Practice on Your Own!

P.4

Move the state_name column to be before county_name in the county_2012 dataset. Use relocate().

county_2012 <- county_2012 %>% relocate(state_name, .before = county_name)
county_2012
## # A tibble: 3,140 × 4
##    state_name county_name population pop1000
##    <chr>      <chr>            <int>   <dbl>
##  1 TX         Loving              85   0.085
##  2 TX         King               227   0.227
##  3 NE         McPherson          348   0.348
##  4 TX         Kenedy             451   0.451
##  5 NE         Arthur             490   0.49 
##  6 NE         Loup               534   0.534
##  7 MT         Petroleum          566   0.566
##  8 NE         Blaine             584   0.584
##  9 NE         Grant              608   0.608
## 10 NM         Harding            612   0.612
## # ℹ 3,130 more rows

P.5

How can you find the value of population for Chambers AL without just looking at the data manually and instead use functions we learned today?

county_2012 %>%
  filter(county_name == "Chambers", state_name == "AL") %>%
  select(population)
## # A tibble: 1 × 1
##   population
##        <int>
## 1      34228