Data used

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms

You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_R_class/data/Bike_Lanes.csv

library(readr)
library(dplyr)
library(tidyverse)
library(jhur)

bike = read_csv(
  "http://jhudatascience.org/intro_to_R_class/data/Bike_Lanes.csv")

or use

library(jhur)
bike <- read_bike()

## Rows: 1631 Columns: 9

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Part 1

How many bike “lanes” are currently in Baltimore? You can assume each observation/row is a different bike “lane”. (hint: how do you get the number of rows of a data set? You can use dim() or nrow() or another function).

nrow(bike)

## [1] 1631

dim(bike)

## [1] 1631    9

bike %>% 
  nrow()

## [1] 1631

How many feet of bike “lanes” are currently in Baltimore, based on the length column? (use sum())

sum(bike$length)

## [1] 439447.6

sum(bike %>% pull(length))

## [1] 439447.6

Summarize the data to get the max and min of length using the summarize function.

bike %>% summarize(max = max(length),
                   min = min(length))

## # A tibble: 1 x 2
##     max   min
##   <dbl> <dbl>
## 1 3749.     0

Part 2

How many types of bike lanes are there? (Hints: unique, table, or bike %>% count() on the column named type).

# So many ways!!

bike %>% count(type)

## # A tibble: 8 x 2
##   type                n
##   <chr>           <int>
## 1 BIKE BOULEVARD     49
## 2 BIKE LANE         621
## 3 CONTRAFLOW         13
## 4 SHARED BUS BIKE    39
## 5 SHARROW           589
## 6 SIDEPATH            7
## 7 SIGNED ROUTE      304
## 8 <NA>                9

bike %>% pull(type) %>% table()

## .
##  BIKE BOULEVARD       BIKE LANE      CONTRAFLOW SHARED BUS BIKE         SHARROW 
##              49             621              13              39             589 
##        SIDEPATH    SIGNED ROUTE 
##               7             304

unique(bike %>% pull(type))

## [1] "BIKE BOULEVARD"  "SIDEPATH"        "SIGNED ROUTE"    "BIKE LANE"      
## [5] "SHARROW"         NA                "CONTRAFLOW"      "SHARED BUS BIKE"

table(bike$type, useNA = "ifany")

## 
##  BIKE BOULEVARD       BIKE LANE      CONTRAFLOW SHARED BUS BIKE         SHARROW 
##              49             621              13              39             589 
##        SIDEPATH    SIGNED ROUTE            <NA> 
##               7             304               9

unique(bike$type)

## [1] "BIKE BOULEVARD"  "SIDEPATH"        "SIGNED ROUTE"    "BIKE LANE"      
## [5] "SHARROW"         NA                "CONTRAFLOW"      "SHARED BUS BIKE"

length(table(bike$type))

## [1] 7

length(unique(bike$type))

## [1] 8

is.na(unique(bike$type))

## [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

bike %>%
  group_by(type) %>% 
  summarize(count = n())

## # A tibble: 8 x 2
##   type            count
##   <chr>           <int>
## 1 BIKE BOULEVARD     49
## 2 BIKE LANE         621
## 3 CONTRAFLOW         13
## 4 SHARED BUS BIKE    39
## 5 SHARROW           589
## 6 SIDEPATH            7
## 7 SIGNED ROUTE      304
## 8 <NA>                9

Which bike lanetype has (a) the most number of and (b) longest average bike lane length? (Hint: group_by and summarize). In your summarized output, make sure you call the new summarized average bike lane length variable (column name) “mean”. In other words, the head of your output should look like:

# A tibble: 
  type            number_of_rows  mean
  <chr>                    <int> <dbl>
1 BIKE BOULEVARD              49  197.
...

bike %>% 
  group_by(type) %>% 
  summarize(number_of_rows = n(),
            mean = mean(length))

## # A tibble: 8 x 3
##   type            number_of_rows  mean
##   <chr>                    <int> <dbl>
## 1 BIKE BOULEVARD              49  197.
## 2 BIKE LANE                  621  300.
## 3 CONTRAFLOW                  13  136.
## 4 SHARED BUS BIKE             39  277.
## 5 SHARROW                    589  244.
## 6 SIDEPATH                     7  666.
## 7 SIGNED ROUTE               304  264.
## 8 <NA>                         9  260.

Take your code from the above question and do the following:

Add another pipe (%>%)
Add the arrange() to sort the output by the summarized column “mean”.

bike %>% 
  group_by(type) %>% 
  summarize(number_of_rows = n(),
            mean = mean(length)) %>% 
  arrange(mean)

## # A tibble: 8 x 3
##   type            number_of_rows  mean
##   <chr>                    <int> <dbl>
## 1 CONTRAFLOW                  13  136.
## 2 BIKE BOULEVARD              49  197.
## 3 SHARROW                    589  244.
## 4 <NA>                         9  260.
## 5 SIGNED ROUTE               304  264.
## 6 SHARED BUS BIKE             39  277.
## 7 BIKE LANE                  621  300.
## 8 SIDEPATH                     7  666.

Make a new column using mutate. This new column should be different for each year (dateInstalled), and indicate the total sum number of lanes (numLanes) built in that year. Call this new column year_total and make sure to reassign the dataset. (hint: use group_by first)

bike <- bike %>% 
  group_by(dateInstalled) %>% 
  mutate(year_total = sum(numLanes, na.rm = TRUE))

How does your data from above change if you also group by type? Ungroup your data when you are done.

bike %>% 
  group_by(dateInstalled, type) %>% 
  mutate(year_total = sum(numLanes, na.rm = TRUE))

## # A tibble: 1,631 x 10
## # Groups:   dateInstalled, type [28]
##    subType name    block    type    numLanes project  route length dateInstalled
##    <chr>   <chr>   <chr>    <chr>      <dbl> <chr>    <chr>  <dbl>         <dbl>
##  1 <NA>    <NA>    <NA>     BIKE B…        1 GUILFOR… <NA>   436.              0
##  2 <NA>    <NA>    <NA>     SIDEPA…        1 <NA>     NORT… 1025.           2010
##  3 <NA>    <NA>    <NA>     SIGNED…        1 SOUTHEA… <NA>  3749.           2010
##  4 <NA>    HUNTIN… <NA>     SIDEPA…        1 <NA>     <NA>     0               0
##  5 STCLN   EDMOND… 5300 BL… BIKE L…        1 OPERATI… <NA>   181.           2011
##  6 STRALY  WINSTO… 1200 BL… SIGNED…        2 COLLEGE… COLL…  148.           2007
##  7 STRALY  WINSTO… 1200 BL… SIGNED…        2 COLLEGE… COLL…  366.           2007
##  8 STRALY  WINSTO… 1200 BL… SIGNED…        2 COLLEGE… COLL…  262.           2007
##  9 STRPRD  <NA>    <NA>     BIKE L…        1 MAINTEN… <NA>   696.           2009
## 10 STRPRD  <NA>    <NA>     SHARROW        1 COLLEGE… <NA>    43.1          2007
## # … with 1,621 more rows, and 1 more variable: year_total <dbl>

bike <- ungroup(bike)

Part 2

Create a histogram for the length variable in the bike dataset. Try playing with the breaks= argument.

hist( pull(bike, length), breaks = 100)

Create a scatterplot with dateInstalled on the x axis and length on the y axis.

plot( pull(bike, dateInstalled), pull(bike, length) )

Bonus

A. Summarize the bike data to get the mean of length and dateInstalled. Do this three ways: 1) with summarize, 2) with summarize and across, and 3) with colMeans().

bike %>% summarize( mean_length = mean(length, na.rm = TRUE),
                    mean_date = mean(dateInstalled, na.rm = TRUE))

## # A tibble: 1 x 2
##   mean_length mean_date
##         <dbl>     <dbl>
## 1        269.     1854.

bike %>% 
  summarize(across( c(length, dateInstalled), ~ mean(.x, na.rm = TRUE)))

## # A tibble: 1 x 2
##   length dateInstalled
##    <dbl>         <dbl>
## 1   269.         1854.

bike %>% select(length, dateInstalled) %>% colMeans()

##        length dateInstalled 
##      269.4344     1853.9454

You should have gotten a mean date sometime in the 1800s - that doesn’t make much sense! Hypothesize why the average date is a date from before bike lanes were being built in Baltimore.

There are probably some zeros or other incorrect low values in the data.

B. Change any zeros in bike$dateInstalled to NA using mutate. For the curious,ifelse() in R works takes the same arguments as the “IF” function in Excel!

bike <- bike %>% 
  mutate(dateInstalled = ifelse(dateInstalled == 0, NA, dateInstalled))

bike$dateInstalled[bike$dateInstalled == "0"] <- NA

# How to find NAs?
# is.na(bike$dateInstalled)
# !is.na(bike$dateInstalled)

What is another way to remove zeros from the data?

Add a filtering step

C. What was the average bike lane length grouped by dateInstalled? Remove NAs with na.rm = TRUE.

bike %>%
  group_by(dateInstalled) %>% 
  summarise(mean_of_the_bike = mean(length, na.rm = F))

## # A tibble: 9 x 2
##   dateInstalled mean_of_the_bike
##           <dbl>            <dbl>
## 1          2006            1469.
## 2          2007             310.
## 3          2008             249.
## 4          2009             407.
## 5          2010             246.
## 6          2011             233.
## 7          2012             271.
## 8          2013             290.
## 9            NA             215.

# Can combine 6 & 7! 
mean(bike$length[ !is.na(bike$dateInstalled)])

## [1] 273.9943

bike %>% 
  mutate(length = ifelse(length == 0, NA, length)) %>% 
  group_by(dateInstalled) %>% 
  summarise(n = n(), # Add this column if you want!
            mean_of_the_bike = mean(length, na.rm = TRUE),
            n_missing = sum(is.na(length))) # Add this column if you want!

## # A tibble: 9 x 4
##   dateInstalled     n mean_of_the_bike n_missing
##           <dbl> <int>            <dbl>     <int>
## 1          2006     2            1469.         0
## 2          2007   368             310.         0
## 3          2008   206             249.         0
## 4          2009    86             407.         0
## 5          2010   625             246.         0
## 6          2011   101             233.         0
## 7          2012   107             271.         0
## 8          2013    10             290.         0
## 9            NA   126             217.         1

D. Does the plot from question 10 improve if you remove the zeros?

plot( pull(bike, dateInstalled), pull(bike, length) )

Yes!

E. What kind of plot would be better for showing the length by each year group? Make this plot.

A boxplot would be more appropriate since year behaves more as a category than a continuous number.

boxplot( pull(bike, length) ~ pull(bike, dateInstalled) )

Data Summarization Lab Key

Data used

Part 1

Part 2

Part 2