Part 1

Load all the libraries we will use in this lab.

library(readr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.6
## ✓ tibble  3.1.2     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(jhur)

Create some data to work with.

First, create a vector that has class integer called int_vect that starts at 1 and goes up to 10 and repeats this sequence 3 times using rep (hint use seq()).

int_vect <- rep(seq(1, 10), 3)

Next, create a random vector of integers from 1 to 30 called rand_vect (hint use sample()).

Because we are using a random vector let’s use the set.seed() function to make sure we all have the same result - this code is already be in the code chunk for you. Simply create the vector below the set.seed line.

set.seed(1234)
rand_vect <- sample( 1:30, size = 30, replace = TRUE)

Create a vector that repeats c(TRUE, TRUE, FALSE) 10 times called TF_vect. Also create a vector that repeats c("TRUE", "TRUE", "FALSE") 10 times called TF_vect2.

TF_vect <- rep(c(TRUE, TRUE, FALSE), times = 10)
TF_vect2 <- rep(c("TRUE", "TRUE", "FALSE"), times = 10)

Create a tibble combining these vectors together called vect_data using the following code.

vect_data <- tibble(int_vect, rand_vect, TF_vect, TF_vect2)

Take a look at 5 random rows using the slice_sample() function. Try this a few times to see how the results change.

slice_sample(vect_data, n = 5)

## # A tibble: 5 x 4
##   int_vect rand_vect TF_vect TF_vect2
##      <int>     <int> <lgl>   <chr>   
## 1        3        26 FALSE   FALSE   
## 2        4        22 TRUE    TRUE    
## 3        6         4 TRUE    TRUE    
## 4        5         5 TRUE    TRUE    
## 5        2        16 TRUE    TRUE

slice_sample(vect_data, n = 5)

## # A tibble: 5 x 4
##   int_vect rand_vect TF_vect TF_vect2
##      <int>     <int> <lgl>   <chr>   
## 1       10        24 FALSE   FALSE   
## 2        5        22 FALSE   FALSE   
## 3        8         9 TRUE    TRUE    
## 4        7        21 FALSE   FALSE   
## 5       10        20 TRUE    TRUE

slice_sample(vect_data, n = 5)

## # A tibble: 5 x 4
##   int_vect rand_vect TF_vect TF_vect2
##      <int>     <int> <lgl>   <chr>   
## 1        6        26 TRUE    TRUE    
## 2        2         4 FALSE   FALSE   
## 3        3        26 FALSE   FALSE   
## 4        3        24 TRUE    TRUE    
## 5        9         5 FALSE   FALSE

Check to see if the TF_vect is logical. Check to see if TF_vect2 is logical. Why are the results what they are?

is.logical(TF_vect)

## [1] TRUE

is.logical(TF_vect2)

## [1] FALSE

Logical vectors do not have quotes around TRUE and FALSE values.

Use mutate() function to create a new variable in the data set named type_fact that is of class factor made from the int_vect variable. Take a look at the data.

vect_data <- mutate(vect_data, type_fact = factor(int_vect))
vect_data

## # A tibble: 30 x 5
##    int_vect rand_vect TF_vect TF_vect2 type_fact
##       <int>     <int> <lgl>   <chr>    <fct>    
##  1        1        28 TRUE    TRUE     1        
##  2        2        16 TRUE    TRUE     2        
##  3        3        26 FALSE   FALSE    3        
##  4        4        22 TRUE    TRUE     4        
##  5        5         5 TRUE    TRUE     5        
##  6        6        12 FALSE   FALSE    6        
##  7        7        15 TRUE    TRUE     7        
##  8        8         9 TRUE    TRUE     8        
##  9        9         5 FALSE   FALSE    9        
## 10       10         6 TRUE    TRUE     10       
## # … with 20 more rows

Part 2

Read Read Charm City Circulator data using read_circulator() function from jhur package. Assign it to circ variable. Use str() function to take a look at the data and learn column types.

circ <- read_circulator()

## Rows: 1146 Columns: 15

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): day, date
## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(circ)

## spec_tbl_df [1,146 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ day             : chr [1:1146] "Monday" "Tuesday" "Wednesday" "Thursday" ...
##  $ date            : chr [1:1146] "01/11/2010" "01/12/2010" "01/13/2010" "01/14/2010" ...
##  $ orangeBoardings : num [1:1146] 877 777 1203 1194 1645 ...
##  $ orangeAlightings: num [1:1146] 1027 815 1220 1233 1643 ...
##  $ orangeAverage   : num [1:1146] 952 796 1212 1214 1644 ...
##  $ purpleBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ purpleAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ purpleAverage   : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ greenBoardings  : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ greenAlightings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ greenAverage    : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ bannerBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ bannerAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ bannerAverage   : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##  $ daily           : num [1:1146] 952 796 1212 1214 1644 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   day = col_character(),
##   ..   date = col_character(),
##   ..   orangeBoardings = col_double(),
##   ..   orangeAlightings = col_double(),
##   ..   orangeAverage = col_double(),
##   ..   purpleBoardings = col_double(),
##   ..   purpleAlightings = col_double(),
##   ..   purpleAverage = col_double(),
##   ..   greenBoardings = col_double(),
##   ..   greenAlightings = col_double(),
##   ..   greenAverage = col_double(),
##   ..   bannerBoardings = col_double(),
##   ..   bannerAlightings = col_double(),
##   ..   bannerAverage = col_double(),
##   ..   daily = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Use mutate() function to create a new column named date_formatted that is of Date class. The new variable is created from date column. Hint: use mdy() function.

circ <- mutate(circ, date_formatted = mdy(date))

Move the date_formatted variable to be before date using the relocate function. Take a look at the data using glimpse(). Note the difference between date and date_formatted columns.

circ <- circ %>% relocate(date_formatted, .before = date)
glimpse(circ)

## Rows: 1,146
## Columns: 16
## $ day              <chr> "Monday", "Tuesday", "Wednesday", "Thursday", "Friday…
## $ date_formatted   <date> 2010-01-11, 2010-01-12, 2010-01-13, 2010-01-14, 2010…
## $ date             <chr> "01/11/2010", "01/12/2010", "01/13/2010", "01/14/2010…
## $ orangeBoardings  <dbl> 877, 777, 1203, 1194, 1645, 1457, 839, 999, 1023, 137…
## $ orangeAlightings <dbl> 1027, 815, 1220, 1233, 1643, 1524, 938, 1000, 1047, 1…
## $ orangeAverage    <dbl> 952.0, 796.0, 1211.5, 1213.5, 1644.0, 1490.5, 888.5, …
## $ purpleBoardings  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ purpleAlightings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ purpleAverage    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenBoardings   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenAlightings  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ greenAverage     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerBoardings  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerAlightings <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bannerAverage    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ daily            <dbl> 952.0, 796.0, 1211.5, 1213.5, 1644.0, 1490.5, 888.5, …

Create a list data object called classes_data that combines the vect_data from the first part of the lab and circ. Use glimpse() to look at the data.

classes_data <-list(vect_data, circ)
glimpse(classes_data)

## List of 2
##  $ : tibble [30 × 5] (S3: tbl_df/tbl/data.frame)
##   ..$ int_vect : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ rand_vect: int [1:30] 28 16 26 22 5 12 15 9 5 6 ...
##   ..$ TF_vect  : logi [1:30] TRUE TRUE FALSE TRUE TRUE FALSE ...
##   ..$ TF_vect2 : chr [1:30] "TRUE" "TRUE" "FALSE" "TRUE" ...
##   ..$ type_fact: Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ : tibble [1,146 × 16] (S3: tbl_df/tbl/data.frame)
##   ..$ day             : chr [1:1146] "Monday" "Tuesday" "Wednesday" "Thursday" ...
##   ..$ date_formatted  : Date[1:1146], format: "2010-01-11" "2010-01-12" ...
##   ..$ date            : chr [1:1146] "01/11/2010" "01/12/2010" "01/13/2010" "01/14/2010" ...
##   ..$ orangeBoardings : num [1:1146] 877 777 1203 1194 1645 ...
##   ..$ orangeAlightings: num [1:1146] 1027 815 1220 1233 1643 ...
##   ..$ orangeAverage   : num [1:1146] 952 796 1212 1214 1644 ...
##   ..$ purpleBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ purpleAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ purpleAverage   : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ greenBoardings  : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ greenAlightings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ greenAverage    : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ bannerBoardings : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ bannerAlightings: num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ bannerAverage   : num [1:1146] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ daily           : num [1:1146] 952 796 1212 1214 1644 ...

Bonus: Use range() function on date_formatted variable to display the range of dates in the data set. How does this compare to that of date? Why?

range(circ$date_formatted)

## [1] "2010-01-11" "2013-03-01"

range(circ$date)

## [1] "01/01/2011" "12/31/2012"

The max of range(circ$date) is numerical not based on date.

Bonus: Use table() function on day variable to display the count of orangeBoardings observations with each day. Which day had the most boardings? Is this true for the other routes (purpleBoardings, greenBoardings, bannerBoardings)?

circ  %>%
  group_by(day) %>% 
  summarize(n = sum(orangeBoardings, na.rm = TRUE))

## # A tibble: 7 x 2
##   day            n
##   <chr>      <dbl>
## 1 Friday    584070
## 2 Monday    479914
## 3 Saturday  437682
## 4 Sunday    283648
## 5 Thursday  495142
## 6 Tuesday   488682
## 7 Wednesday 501440

circ  %>%
  group_by(day) %>% 
  summarize(n = sum(purpleBoardings, na.rm = TRUE))

## # A tibble: 7 x 2
##   day            n
##   <chr>      <dbl>
## 1 Friday    698199
## 2 Monday    561344
## 3 Saturday  501958
## 4 Sunday    333439
## 5 Thursday  590142
## 6 Tuesday   576164
## 7 Wednesday 601997

circ  %>%
  group_by(day) %>% 
  summarize(n = sum(greenBoardings, na.rm = TRUE))

## # A tibble: 7 x 2
##   day            n
##   <chr>      <dbl>
## 1 Friday    162829
## 2 Monday    136575
## 3 Saturday  109121
## 4 Sunday     81044
## 5 Thursday  145750
## 6 Tuesday   148742
## 7 Wednesday 151503

circ  %>%
  group_by(day) %>% 
  summarize(n = sum(bannerBoardings, na.rm = TRUE))

## # A tibble: 7 x 2
##   day           n
##   <chr>     <dbl>
## 1 Friday    39530
## 2 Monday    28953
## 3 Saturday  35957
## 4 Sunday    25950
## 5 Thursday  32561
## 6 Tuesday   29231
## 7 Wednesday 31809

Data Class Lab Key

Part 1

Part 2