Part 1

1.1

Load all the packages we will use in this lab.

library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(jhur)

Create some data to work with by running the following code chunk.

set.seed(1234)

int_vect <- rep(seq(from = 1, to = 10), times = 3)
rand_vect <- sample(x = 1:30, size = 30, replace = TRUE)
TF_vect <- rep(c(TRUE, TRUE, FALSE), times = 10)
TF_vect2 <- rep(c("TRUE", "TRUE", "FALSE"), times = 10)

1.2

Determine the class of each of these new objects.

class(int_vect) # [1] "integer"
## [1] "integer"
class(rand_vect) # [1] "integer"
## [1] "integer"
class(TF_vect) # [1] "logical"
## [1] "logical"
class(TF_vect2) # [1] "character"
## [1] "character"

1.3

Are TF_vect and TF_vect2 different classes? Why or why not?

# Yes!
# Logical vectors do not have quotes around `TRUE` and `FALSE` values.

1.4

Create a tibble combining these vectors together called vect_data using the following code.

vect_data <- tibble(int_vect, rand_vect, TF_vect, TF_vect2)

1.5

Coerce rand_vect to character class using as.character(). Save this vector as rand_char_vect. How is the output for rand_vect and rand_char_vect different?

rand_char_vect <- as.character(rand_vect)
rand_char_vect # Numbers now have quotation marks
##  [1] "28" "16" "26" "22" "5"  "12" "15" "9"  "5"  "6"  "16" "4"  "2"  "7"  "22"
## [16] "26" "6"  "15" "14" "20" "14" "30" "24" "30" "4"  "4"  "21" "8"  "20" "24"

1.6

Read in the Charm City Circulator data using read_circulator() function from jhur package using the code supplied in the chunk. Or alternatively using the url link.

circ <- read_circulator()
## Rows: 1146 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): day, date
## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
circ <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Charm_City_Circulator_Ridership.csv")
## Rows: 1146 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): day, date
## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Use the str() function to take a look at the data and learn about the column types.

str(circ)
## spc_tbl_ [1,146 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ day             : chr [1:1146] "Monday" "Tuesday" "Wednesday" "Thursday" ...
##  $ date            : chr [1:1146] "01/11/2010" "01/12/2010" "01/13/2010" "01/14/2010" ...
##  $ orangeBoardings : num [1:1146] 877 777 1203 1194 1645 ...
##  $ orangeAlightings: num [1:1146] 1027 815 1220 1233 1643 ...
##  $ orangeAverage   : num [1:1146] 952 796 1212 1214 1644 ...
##  $ purpleBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ purpleAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ purpleAverage   : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ greenBoardings  : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ greenAlightings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ greenAverage    : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ bannerBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ bannerAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ bannerAverage   : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ daily           : num [1:1146] 952 796 1212 1214 1644 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   day = col_character(),
##   ..   date = col_character(),
##   ..   orangeBoardings = col_double(),
##   ..   orangeAlightings = col_double(),
##   ..   orangeAverage = col_double(),
##   ..   purpleBoardings = col_double(),
##   ..   purpleAlightings = col_double(),
##   ..   purpleAverage = col_double(),
##   ..   greenBoardings = col_double(),
##   ..   greenAlightings = col_double(),
##   ..   greenAverage = col_double(),
##   ..   bannerBoardings = col_double(),
##   ..   bannerAlightings = col_double(),
##   ..   bannerAverage = col_double(),
##   ..   daily = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

1.7

Use the mutate() function to create a new column named date_formatted that is of Date class. The new variable is created from date column. Hint: use mdy() function. Reassign to circ.

# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
circ <- mutate(circ, date_formatted = mdy(date))

Practice on Your Own!

P.1

Move the date_formatted variable to be before date using the relocate function. Take a look at the data using glimpse(). Note the difference between date and date_formatted columns.

# General format
NEWDATA <- OLD_DATA %>% relocate(COLUMN1, .before = COLUMN2)
circ <- circ %>% relocate(date_formatted, .before = date)

# alternative
# circ <- circ %>% select(day, date_formatted, everything()) %>% head() 

glimpse(circ)
## Rows: 1,146
## Columns: 16
## $ day              <chr> "Monday", "Tuesday", "Wednesday", "Thursday", "Friday…
## $ date_formatted   <date> 2010-01-11, 2010-01-12, 2010-01-13, 2010-01-14, 2010…
## $ date             <chr> "01/11/2010", "01/12/2010", "01/13/2010", "01/14/2010…
## $ orangeBoardings  <dbl> 877, 777, 1203, 1194, 1645, 1457, 839, 999, 1023, 137…
## $ orangeAlightings <dbl> 1027, 815, 1220, 1233, 1643, 1524, 938, 1000, 1047, 1…
## $ orangeAverage    <dbl> 952.0, 796.0, 1211.5, 1213.5, 1644.0, 1490.5, 888.5, …
## $ purpleBoardings  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ purpleAlightings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ purpleAverage    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenBoardings   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenAlightings  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenAverage     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerBoardings  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerAlightings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerAverage    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ daily            <dbl> 952.0, 796.0, 1211.5, 1213.5, 1644.0, 1490.5, 888.5, …

P.2

Use range() function on date_formatted variable to display the range of dates in the data set. How does this compare to that of date? Why? (Hint: use the pull function first to pull the values.)

pull(circ, date_formatted) %>% range()
## [1] "2010-01-11" "2013-03-01"
pull(circ, date) %>% range()
## [1] "01/01/2011" "12/31/2012"
# The max of `pull(circ, date) %>% range()` is numerical not based on date.