Load all the packages we will use in this lab.
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(jhur)
Create some data to work with by running the following code chunk.
set.seed(1234)
int_vect <- rep(seq(from = 1, to = 10), times = 3)
rand_vect <- sample(x = 1:30, size = 30, replace = TRUE)
TF_vect <- rep(c(TRUE, TRUE, FALSE), times = 10)
TF_vect2 <- rep(c("TRUE", "TRUE", "FALSE"), times = 10)
Determine the class of each of these new objects.
class(int_vect) # [1] "integer"
## [1] "integer"
class(rand_vect) # [1] "integer"
## [1] "integer"
class(TF_vect) # [1] "logical"
## [1] "logical"
class(TF_vect2) # [1] "character"
## [1] "character"
Are TF_vect
and TF_vect2
different classes?
Why or why not?
# Yes!
# Logical vectors do not have quotes around `TRUE` and `FALSE` values.
Create a tibble combining these vectors together called
vect_data
using the following code.
vect_data <- tibble(int_vect, rand_vect, TF_vect, TF_vect2)
Coerce rand_vect
to character class using
as.character()
. Save this vector as
rand_char_vect
. How is the output for
rand_vect
and rand_char_vect
different?
rand_char_vect <- as.character(rand_vect)
rand_char_vect # Numbers now have quotation marks
## [1] "28" "16" "26" "22" "5" "12" "15" "9" "5" "6" "16" "4" "2" "7" "22"
## [16] "26" "6" "15" "14" "20" "14" "30" "24" "30" "4" "4" "21" "8" "20" "24"
Read in the Charm City Circulator data using
read_circulator()
function from jhur
package
using the code supplied in the chunk. Or alternatively using the url
link.
circ <- read_circulator()
## Rows: 1146 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): day, date
## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
circ <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Charm_City_Circulator_Ridership.csv")
## Rows: 1146 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): day, date
## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the str()
function to take a look at the data and
learn about the column types.
str(circ)
## spc_tbl_ [1,146 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ day : chr [1:1146] "Monday" "Tuesday" "Wednesday" "Thursday" ...
## $ date : chr [1:1146] "01/11/2010" "01/12/2010" "01/13/2010" "01/14/2010" ...
## $ orangeBoardings : num [1:1146] 877 777 1203 1194 1645 ...
## $ orangeAlightings: num [1:1146] 1027 815 1220 1233 1643 ...
## $ orangeAverage : num [1:1146] 952 796 1212 1214 1644 ...
## $ purpleBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ purpleAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ purpleAverage : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ greenBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ greenAlightings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ greenAverage : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ bannerBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ bannerAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ bannerAverage : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
## $ daily : num [1:1146] 952 796 1212 1214 1644 ...
## - attr(*, "spec")=
## .. cols(
## .. day = col_character(),
## .. date = col_character(),
## .. orangeBoardings = col_double(),
## .. orangeAlightings = col_double(),
## .. orangeAverage = col_double(),
## .. purpleBoardings = col_double(),
## .. purpleAlightings = col_double(),
## .. purpleAverage = col_double(),
## .. greenBoardings = col_double(),
## .. greenAlightings = col_double(),
## .. greenAverage = col_double(),
## .. bannerBoardings = col_double(),
## .. bannerAlightings = col_double(),
## .. bannerAverage = col_double(),
## .. daily = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Use the mutate()
function to create a new column named
date_formatted
that is of Date
class. The new
variable is created from date
column. Hint: use
mdy()
function. Reassign to circ
.
# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
circ <- mutate(circ, date_formatted = mdy(date))
Move the date_formatted
variable to be before
date
using the relocate
function. Take a look
at the data using glimpse()
. Note the difference between
date
and date_formatted
columns.
# General format
NEWDATA <- OLD_DATA %>% relocate(COLUMN1, .before = COLUMN2)
circ <- circ %>% relocate(date_formatted, .before = date)
# alternative
# circ <- circ %>% select(day, date_formatted, everything()) %>% head()
glimpse(circ)
## Rows: 1,146
## Columns: 16
## $ day <chr> "Monday", "Tuesday", "Wednesday", "Thursday", "Friday…
## $ date_formatted <date> 2010-01-11, 2010-01-12, 2010-01-13, 2010-01-14, 2010…
## $ date <chr> "01/11/2010", "01/12/2010", "01/13/2010", "01/14/2010…
## $ orangeBoardings <dbl> 877, 777, 1203, 1194, 1645, 1457, 839, 999, 1023, 137…
## $ orangeAlightings <dbl> 1027, 815, 1220, 1233, 1643, 1524, 938, 1000, 1047, 1…
## $ orangeAverage <dbl> 952.0, 796.0, 1211.5, 1213.5, 1644.0, 1490.5, 888.5, …
## $ purpleBoardings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ purpleAlightings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ purpleAverage <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenBoardings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenAlightings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenAverage <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerBoardings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerAlightings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerAverage <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ daily <dbl> 952.0, 796.0, 1211.5, 1213.5, 1644.0, 1490.5, 888.5, …
Use range()
function on date_formatted
variable to display the range of dates in the data set. How does this
compare to that of date
? Why? (Hint: use the
pull
function first to pull the values.)
pull(circ, date_formatted) %>% range()
## [1] "2010-01-11" "2013-03-01"
pull(circ, date) %>% range()
## [1] "01/01/2011" "12/31/2012"
# The max of `pull(circ, date) %>% range()` is numerical not based on date.