Part 1

Data used

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms

You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv

library(tidyverse)

bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv")

1.1

How many streets with designated bike lanes are currently in Baltimore? You can assume each observation/row is a different street with one or more bike lanes. (Hint: how do you get the number of rows of a data set? You can use dim() or nrow() or another function).

nrow(bike)

## [1] 1631

dim(bike)

## [1] 1631    9

bike %>% nrow()

## [1] 1631

bike %>% count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  1631

1.2

How many feet of bike “lanes” are currently in Baltimore, based on the length column? (use sum())

sum(bike$length)

## [1] 439447.6

sum(bike %>% pull(length))

## [1] 439447.6

bike %>%
  pull(length) %>%
  sum()

## [1] 439447.6

1.3

Summarize the data to get the max of length using the summarize function.

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))

bike %>% summarize(
  max = max(length)
)

## # A tibble: 1 × 1
##     max
##   <dbl>
## 1 3749.

1.4

Modify your code from 1.3 to add the min of length using the summarize function.

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
              SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
    )

bike %>% summarize(
  max = max(length),
  min = min(length)
)

## # A tibble: 1 × 2
##     max   min
##   <dbl> <dbl>
## 1 3749.     0

Practice on Your Own!

P.1

Summarize the bike data to get the mean of length and dateInstalled. Make sure to remove NAs.

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE),
              SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE)
    )

bike %>% summarize(
  mean_length = mean(length, na.rm = TRUE),
  mean_date = mean(dateInstalled, na.rm = TRUE)
)

## # A tibble: 1 × 2
##   mean_length mean_date
##         <dbl>     <dbl>
## 1        269.     1854.

bike %>%
  select(length, dateInstalled) %>%
  colMeans()

##        length dateInstalled 
##      269.4344     1853.9454

# The mean date is in the 1800s -- that doesn't seem right!

P.2

You should have gotten a mean date sometime in the 1800s - that doesn’t make much sense! Hypothesize why the average date is a date from before bike lanes were being built in Baltimore.

# There are probably some zeros or other incorrect low values in the data.

P.3

Filter any zeros out of bike dateInstalled. Use filter(). Assign this “cleaned” dataset object the name bike_2.

# General format 
DATA_TIBBLE %>% filter(LOGICAL_COMPARISON)

bike_2 <- bike %>% filter(dateInstalled != 0)

Part 2

2.1

How many bike lanes are there in each type of lane? Use count() on the column named type. Use bike instead of bike_2.

bike %>% count(type)

## # A tibble: 8 × 2
##   type                n
##   <chr>           <int>
## 1 BIKE BOULEVARD     49
## 2 BIKE LANE         621
## 3 CONTRAFLOW         13
## 4 SHARED BUS BIKE    39
## 5 SHARROW           589
## 6 SIDEPATH            7
## 7 SIGNED ROUTE      304
## 8 <NA>                9

2.2

Modify your code from question 2.1 to break down each lane type by number of lanes. Use count() on the columns named type and numLanes.

bike %>% count(type, numLanes)

## # A tibble: 16 × 3
##    type            numLanes     n
##    <chr>              <dbl> <int>
##  1 BIKE BOULEVARD         1     1
##  2 BIKE BOULEVARD         2    48
##  3 BIKE LANE              0    20
##  4 BIKE LANE              1   411
##  5 BIKE LANE              2   190
##  6 CONTRAFLOW             1     7
##  7 CONTRAFLOW             2     6
##  8 SHARED BUS BIKE        1    39
##  9 SHARROW                1   217
## 10 SHARROW                2   372
## 11 SIDEPATH               1     6
## 12 SIDEPATH               2     1
## 13 SIGNED ROUTE           1   211
## 14 SIGNED ROUTE           2    93
## 15 <NA>                   0     1
## 16 <NA>                   1     8

2.3

How many bike lanes are there in each type of lane? Use group_by(), summarize(), and n() on the column named type.

# General format 
DATA_TIBBLE %>% 
    group_by(GROUPING_COLUMN_NAME) %>% 
    summarize(SUMMARY_COLUMN_NAME = n())

bike %>%
  group_by(type) %>%
  summarize(count = n())

## # A tibble: 8 × 2
##   type            count
##   <chr>           <int>
## 1 BIKE BOULEVARD     49
## 2 BIKE LANE         621
## 3 CONTRAFLOW         13
## 4 SHARED BUS BIKE    39
## 5 SHARROW           589
## 6 SIDEPATH            7
## 7 SIGNED ROUTE      304
## 8 <NA>                9

2.4

Modify your code from 2.3 to also group by numLanes.

bike %>%
  group_by(type, numLanes) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.

## # A tibble: 16 × 3
## # Groups:   type [8]
##    type            numLanes count
##    <chr>              <dbl> <int>
##  1 BIKE BOULEVARD         1     1
##  2 BIKE BOULEVARD         2    48
##  3 BIKE LANE              0    20
##  4 BIKE LANE              1   411
##  5 BIKE LANE              2   190
##  6 CONTRAFLOW             1     7
##  7 CONTRAFLOW             2     6
##  8 SHARED BUS BIKE        1    39
##  9 SHARROW                1   217
## 10 SHARROW                2   372
## 11 SIDEPATH               1     6
## 12 SIDEPATH               2     1
## 13 SIGNED ROUTE           1   211
## 14 SIGNED ROUTE           2    93
## 15 <NA>                   0     1
## 16 <NA>                   1     8

Practice on Your Own!

P.4

Modify code from 2.3 to also summarize by longest average bike lane length? In your summarized output, make sure you call the new summarized average bike lane length variable (column name) “mean”. In other words, the head of your output should look like:

# A tibble: 
  type                     count  mean
  <chr>                    <int> <dbl>
1 BIKE BOULEVARD              49  197.
...

bike %>%
  group_by(type) %>%
  summarize(
    count = n(),
    mean = mean(length)
  )

## # A tibble: 8 × 3
##   type            count  mean
##   <chr>           <int> <dbl>
## 1 BIKE BOULEVARD     49  197.
## 2 BIKE LANE         621  300.
## 3 CONTRAFLOW         13  136.
## 4 SHARED BUS BIKE    39  277.
## 5 SHARROW           589  244.
## 6 SIDEPATH            7  666.
## 7 SIGNED ROUTE      304  264.
## 8 <NA>                9  260.

P.5

Take your code from the above question P.4 and do the following:

Add another pipe (%>%)
Add the arrange() to sort the output by the summarized column “mean”.

bike %>%
  group_by(type) %>%
  summarize(
    count = n(),
    mean = mean(length)
  ) %>%
  arrange(mean)

## # A tibble: 8 × 3
##   type            count  mean
##   <chr>           <int> <dbl>
## 1 CONTRAFLOW         13  136.
## 2 BIKE BOULEVARD     49  197.
## 3 SHARROW           589  244.
## 4 <NA>                9  260.
## 5 SIGNED ROUTE      304  264.
## 6 SHARED BUS BIKE    39  277.
## 7 BIKE LANE         621  300.
## 8 SIDEPATH            7  666.

Data Summarization Lab - Key

Part 1

1.1

1.2

1.3

1.4

Practice on Your Own!

P.1

P.2

P.3

Part 2

2.1

2.2

2.3

2.4

Practice on Your Own!

P.4

P.5