In this lab you can use the interactive console to explore but please record your commands here. Remember anything you type here can be “sent” to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) (But only inside the code chunks designated with the {r}
areas).
Now we will work with the county_pop
data from the same case study.
# don't forget to load the packages that you will need!
library(tidyverse)
Import the data from “https://jhudatascience.org/intro_to_r/data/county_pop.csv” using read_csv
and assign the data to an object called county_pop
. Check that it worked by seeing if you have the county_pop
data.
county_pop <- read_csv("https://jhudatascience.org/intro_to_r/data/county_pop.csv")
## Rows: 28265 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): BUYER_COUNTY, BUYER_STATE, countyfips, county_name, NAME, variable
## dbl (4): STATE, COUNTY, year, population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(county_pop)
## # A tibble: 6 × 10
## BUYER_COUNTY BUYER_STATE countyfips STATE COUNTY county_name NAME variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 AUTAUGA AL 01001 1 1 Autauga Autauga… B01003_…
## 2 BALDWIN AL 01003 1 3 Baldwin Baldwin… B01003_…
## 3 BARBOUR AL 01005 1 5 Barbour Barbour… B01003_…
## 4 BIBB AL 01007 1 7 Bibb Bibb Co… B01003_…
## 5 BLOUNT AL 01009 1 9 Blount Blount … B01003_…
## 6 BULLOCK AL 01011 1 11 Bullock Bullock… B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>
What class is county_pop
?
class(county_pop)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
How many observations (rows) and variables (columns) are in the dataset - try the dim()
function?
dim(county_pop)
## [1] 28265 10
nrow(county_pop)
## [1] 28265
ncol(county_pop)
## [1] 10
Next, rename the column BUYER_STATE
to be State_name
(hint - use rename()
and watch out for the order of the new and old names!).
county_pop <- rename(county_pop, State_name = BUYER_STATE)
head(county_pop)
## # A tibble: 6 × 10
## BUYER_COUNTY State_name countyfips STATE COUNTY county_name NAME variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 AUTAUGA AL 01001 1 1 Autauga Autauga … B01003_…
## 2 BALDWIN AL 01003 1 3 Baldwin Baldwin … B01003_…
## 3 BARBOUR AL 01005 1 5 Barbour Barbour … B01003_…
## 4 BIBB AL 01007 1 7 Bibb Bibb Cou… B01003_…
## 5 BLOUNT AL 01009 1 9 Blount Blount C… B01003_…
## 6 BULLOCK AL 01011 1 11 Bullock Bullock … B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>
Convert the column names of county_pop
to be all lower case. Use rename_with()
, and the tolower
command.
county_pop <- rename_with(county_pop, tolower)
head(county_pop)
## # A tibble: 6 × 10
## buyer_county state_name countyfips state county county_name name variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 AUTAUGA AL 01001 1 1 Autauga Autauga … B01003_…
## 2 BALDWIN AL 01003 1 3 Baldwin Baldwin … B01003_…
## 3 BARBOUR AL 01005 1 5 Barbour Barbour … B01003_…
## 4 BIBB AL 01007 1 7 Bibb Bibb Cou… B01003_…
## 5 BLOUNT AL 01009 1 9 Blount Blount C… B01003_…
## 6 BULLOCK AL 01011 1 11 Bullock Bullock … B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>
dim(county_pop)
## [1] 28265 10
How can you show the first 3 rows and the last 3 rows of county_pop
(in two lines of code)?
head(county_pop, 3)
## # A tibble: 3 × 10
## buyer_county state_name countyfips state county county_name name variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 AUTAUGA AL 01001 1 1 Autauga Autauga … B01003_…
## 2 BALDWIN AL 01003 1 3 Baldwin Baldwin … B01003_…
## 3 BARBOUR AL 01005 1 5 Barbour Barbour … B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>
tail(county_pop, 3)
## # A tibble: 3 × 10
## buyer_county state_name countyfips state county county_name name variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 SKAGWAY AK 02230 2 230 Skagway Skagw… B01003_…
## 2 HOONAH ANGOON AK 02105 2 105 Hoonah Angoon Hoona… B01003_…
## 3 PETERSBURG AK 02195 2 195 Petersburg Peter… B01003_…
## # ℹ 2 more variables: year <dbl>, population <dbl>
Create a subset of the county_pop
that only contains the columns: county_name
, year
, and population
and assign this object to pop_sub
- what are the dimensions of this dataset?
pop_sub <- select(county_pop, county_name, year, population)
dim(pop_sub)
## [1] 28265 3
Start with county_pop
again instead of the dataset you just made. Subset the data to only include the population
column and the columns that end with “_name”. Hint: use select()
and ends_with()
. Assign this subset of the data to be pop2
. Again take a look at the data and check the dimensions.
pop2 <- select(county_pop, population, ends_with("_name"))
pop2
## # A tibble: 28,265 × 3
## population state_name county_name
## <dbl> <chr> <chr>
## 1 51328 AL Autauga
## 2 168121 AL Baldwin
## 3 27861 AL Barbour
## 4 22099 AL Bibb
## 5 55485 AL Blount
## 6 10776 AL Bullock
## 7 20815 AL Butler
## 8 115388 AL Calhoun
## 9 34945 AL Chambers
## 10 25466 AL Cherokee
## # ℹ 28,255 more rows
dim(pop2)
## [1] 28265 3
Pull the variable county_name
from pop2
. How does this differ form selecting it? Use head() to take a look at both options.
head(pull(pop2, county_name))
## [1] "Autauga" "Baldwin" "Barbour" "Bibb" "Blount" "Bullock"
head(select(pop2, county_name))
## # A tibble: 6 × 1
## county_name
## <chr>
## 1 Autauga
## 2 Baldwin
## 3 Barbour
## 4 Bibb
## 5 Blount
## 6 Bullock
Subset the rows of county_pop
that have more than 100000 for population - how many rows are there? Use filter()
.
county_pop_sub <- filter(county_pop, population > 100000)
nrow(county_pop_sub)
## [1] 5159
Subset the rows of county_pop
that have a population of more than 100000 and have a year value less than 2010 - how many are there?
filter(county_pop, population > 100000 & year < 2010) # all of these options work
## # A tibble: 2,260 × 10
## buyer_county state_name countyfips state county county_name name variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 BALDWIN AL 01003 1 3 Baldwin Baldwin… B01003_…
## 2 CALHOUN AL 01015 1 15 Calhoun Calhoun… B01003_…
## 3 ETOWAH AL 01055 1 55 Etowah Etowah … B01003_…
## 4 JEFFERSON AL 01073 1 73 Jefferson Jeffers… B01003_…
## 5 LEE AL 01081 1 81 Lee Lee Cou… B01003_…
## 6 MADISON AL 01089 1 89 Madison Madison… B01003_…
## 7 MOBILE AL 01097 1 97 Mobile Mobile … B01003_…
## 8 MONTGOMERY AL 01101 1 101 Montgomery Montgom… B01003_…
## 9 MORGAN AL 01103 1 103 Morgan Morgan … B01003_…
## 10 SHELBY AL 01117 1 117 Shelby Shelby … B01003_…
## # ℹ 2,250 more rows
## # ℹ 2 more variables: year <dbl>, population <dbl>
nrow(filter(county_pop, population > 100000 & year < 2010 ))
## [1] 2260
nrow(filter(county_pop, population > 100000, year < 2010))
## [1] 2260
Subset the rows of county_pop
that have a population of more than 100000 and a year value of less than or equal to 2010 - how many are there?
filter(county_pop, population > 100000 & year <= 2010) # all of these options work
## # A tibble: 2,834 × 10
## buyer_county state_name countyfips state county county_name name variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 BALDWIN AL 01003 1 3 Baldwin Baldwin… B01003_…
## 2 CALHOUN AL 01015 1 15 Calhoun Calhoun… B01003_…
## 3 ETOWAH AL 01055 1 55 Etowah Etowah … B01003_…
## 4 JEFFERSON AL 01073 1 73 Jefferson Jeffers… B01003_…
## 5 LEE AL 01081 1 81 Lee Lee Cou… B01003_…
## 6 MADISON AL 01089 1 89 Madison Madison… B01003_…
## 7 MOBILE AL 01097 1 97 Mobile Mobile … B01003_…
## 8 MONTGOMERY AL 01101 1 101 Montgomery Montgom… B01003_…
## 9 MORGAN AL 01103 1 103 Morgan Morgan … B01003_…
## 10 SHELBY AL 01117 1 117 Shelby Shelby … B01003_…
## # ℹ 2,824 more rows
## # ℹ 2 more variables: year <dbl>, population <dbl>
nrow(filter(county_pop, population > 100000 & year <= 2010))
## [1] 2834
nrow(filter(county_pop, population > 100000, year <= 2010))
## [1] 2834
Why do the answers for 2.5 and 2.6 differ?
# They differ because there are rows with values that are exactly 2010.
Subset the rows of county_pop
for rows that have state_name
of NM
, or less than 10000 population
. How many rows have both?
filter(county_pop, state_name == "NM" | population < 10000)
## # A tibble: 6,526 × 10
## buyer_county state_name countyfips state county county_name name variable
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GREENE AL 01063 1 63 Greene Gree… B01003_…
## 2 ALEUTIANS EAST AK 02013 2 13 Aleutians … Aleu… B01003_…
## 3 ALEUTIANS WEST AK 02016 2 16 Aleutians … Aleu… B01003_…
## 4 BRISTOL BAY AK 02060 2 60 Bristol Bay Bris… B01003_…
## 5 DENALI AK 02068 2 68 Denali Dena… B01003_…
## 6 DILLINGHAM AK 02070 2 70 Dillingham Dill… B01003_…
## 7 HAINES AK 02100 2 100 Haines Hain… B01003_…
## 8 LAKE AND PENIN… AK 02164 2 164 Lake and P… Lake… B01003_…
## 9 NOME AK 02180 2 180 Nome Nome… B01003_…
## 10 NORTH SLOPE AK 02185 2 185 North Slope Nort… B01003_…
## # ℹ 6,516 more rows
## # ℹ 2 more variables: year <dbl>, population <dbl>
nrow(filter(county_pop, state_name == "NM" & population < 10000))
## [1] 72
Select the variables that contain the letter “a” from cars_mt_copy
.
select(county_pop, contains("a"))
## # A tibble: 28,265 × 7
## state_name state county_name name variable year population
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 AL 1 Autauga Autauga County, Alaba… B01003_… 2006 51328
## 2 AL 1 Baldwin Baldwin County, Alaba… B01003_… 2006 168121
## 3 AL 1 Barbour Barbour County, Alaba… B01003_… 2006 27861
## 4 AL 1 Bibb Bibb County, Alabama B01003_… 2006 22099
## 5 AL 1 Blount Blount County, Alabama B01003_… 2006 55485
## 6 AL 1 Bullock Bullock County, Alaba… B01003_… 2006 10776
## 7 AL 1 Butler Butler County, Alabama B01003_… 2006 20815
## 8 AL 1 Calhoun Calhoun County, Alaba… B01003_… 2006 115388
## 9 AL 1 Chambers Chambers County, Alab… B01003_… 2006 34945
## 10 AL 1 Cherokee Cherokee County, Alab… B01003_… 2006 25466
## # ℹ 28,255 more rows
Create a subset called county_2012
from county_pop
that only contains the rows for the year 2012 and only the columns: county_name
, state_name
, population
. The column year
should not be included in county_2012
. What are the dimensions of this dataset? Don’t use pipes (%>%
) and instead do this in two steps creating the county_2012
object with filter
and updating it with select
.
county_2012 <- filter(county_pop, year == 2012)
county_2012 <- select(county_2012, county_name, state_name, population)
county_2012 # can get dimensions just from viewing the tibble
## # A tibble: 3,140 × 3
## county_name state_name population
## <chr> <chr> <dbl>
## 1 Autauga AL 54590
## 2 Baldwin AL 183226
## 3 Barbour AL 27469
## 4 Bibb AL 22769
## 5 Blount AL 57466
## 6 Bullock AL 10779
## 7 Butler AL 20730
## 8 Calhoun AL 117834
## 9 Chambers AL 34228
## 10 Cherokee AL 25917
## # ℹ 3,130 more rows
dim(county_2012) # alternatively can use dim() function
## [1] 3140 3
Try the same thing again with pipes %>%
.
county_2012 <- county_pop %>%
filter(year == 2012) %>%
select(county_name, state_name, population)
county_2012
## # A tibble: 3,140 × 3
## county_name state_name population
## <chr> <chr> <dbl>
## 1 Autauga AL 54590
## 2 Baldwin AL 183226
## 3 Barbour AL 27469
## 4 Bibb AL 22769
## 5 Blount AL 57466
## 6 Bullock AL 10779
## 7 Butler AL 20730
## 8 Calhoun AL 117834
## 9 Chambers AL 34228
## 10 Cherokee AL 25917
## # ℹ 3,130 more rows
dim(county_2012)
## [1] 3140 3
What happens if you do the steps in a different order? Why does this not work?
#county_2012 <- county_pop %>%
# select(county_name, state_name, population) %>%
# filter(year == 2012)
# you get an error because there is now year variable to filter from
Use arrange()
to order the rows of county_2012
by population
in increasing order. What is county with the smallest population?
county_2012 <- arrange(county_2012, population)
county_2012
## # A tibble: 3,140 × 3
## county_name state_name population
## <chr> <chr> <dbl>
## 1 Loving TX 85
## 2 King TX 227
## 3 McPherson NE 348
## 4 Kenedy TX 451
## 5 Arthur NE 490
## 6 Loup NE 534
## 7 Petroleum MT 566
## 8 Blaine NE 584
## 9 Grant NE 608
## 10 Harding NM 612
## # ℹ 3,130 more rows
arrange(county_2012, population) %>%
head(n = 1) %>%
pull(county_name)
## [1] "Loving"
Create a new variable in county_2012
called pop1000
, which is equal to population
divided by 1000, using mutate()
(don’t forget to reassign county_2012
). Use pipes %>%
.
# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
county_2012 <- county_2012 %>% mutate(pop1000 = population/1000)
Move the state_name
column to be before county_name
in the county_2012
dataset. Use relocate()
.
county_2012 <- county_2012 %>% relocate(state_name, .before = county_name)
county_2012
## # A tibble: 3,140 × 4
## state_name county_name population pop1000
## <chr> <chr> <dbl> <dbl>
## 1 TX Loving 85 0.085
## 2 TX King 227 0.227
## 3 NE McPherson 348 0.348
## 4 TX Kenedy 451 0.451
## 5 NE Arthur 490 0.49
## 6 NE Loup 534 0.534
## 7 MT Petroleum 566 0.566
## 8 NE Blaine 584 0.584
## 9 NE Grant 608 0.608
## 10 NM Harding 612 0.612
## # ℹ 3,130 more rows
How can you find the value of population
for Chambers AL without just looking at the data manually and instead use functions we learned today?
county_2012 %>%
filter(county_name == "Chambers", state_name == "AL") %>%
select(population)
## # A tibble: 1 × 1
## population
## <dbl>
## 1 34228