Data used
Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv
library(tidyverse)
bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv")
How many streets with designated bike lanes are currently in Baltimore? You can assume each observation/row is a different street with one or more bike lanes. (Hint: how do you get the number of rows of a data set? You can use dim()
or nrow()
or another function).
nrow(bike)
## [1] 1631
dim(bike)
## [1] 1631 9
bike %>% nrow()
## [1] 1631
bike %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 1631
How many feet of bike “lanes” are currently in Baltimore, based on the length
column? (use sum()
)
sum(bike$length)
## [1] 439447.6
sum(bike %>% pull(length))
## [1] 439447.6
bike %>%
pull(length) %>%
sum()
## [1] 439447.6
Summarize the data to get the max
of length
using the summarize
function.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))
bike %>% summarize(
max = max(length)
)
## # A tibble: 1 × 1
## max
## <dbl>
## 1 3749.
Modify your code from 1.3 to add the min
of length
using the summarize
function.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
)
bike %>% summarize(
max = max(length),
min = min(length)
)
## # A tibble: 1 × 2
## max min
## <dbl> <dbl>
## 1 3749. 0
Summarize the bike
data to get the mean of length
and dateInstalled
. Make sure to remove NA
s.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN, na.rm = TRUE)
)
bike %>% summarize(
mean_length = mean(length, na.rm = TRUE),
mean_date = mean(dateInstalled, na.rm = TRUE)
)
## # A tibble: 1 × 2
## mean_length mean_date
## <dbl> <dbl>
## 1 269. 1854.
bike %>%
select(length, dateInstalled) %>%
colMeans()
## length dateInstalled
## 269.4344 1853.9454
# The mean date is in the 1800s -- that doesn't seem right!
You should have gotten a mean date sometime in the 1800s - that doesn’t make much sense! Hypothesize why the average date is a date from before bike lanes were being built in Baltimore.
# There are probably some zeros or other incorrect low values in the data.
Filter any zeros out of bike
dateInstalled
. Use filter()
. Assign this “cleaned” dataset object the name bike_2
.
# General format
DATA_TIBBLE %>% filter(LOGICAL_COMPARISON)
bike_2 <- bike %>% filter(dateInstalled != 0)
How many bike lanes are there in each type of lane? Use count()
on the column named type
. Use bike
instead of bike_2
.
bike %>% count(type)
## # A tibble: 8 × 2
## type n
## <chr> <int>
## 1 BIKE BOULEVARD 49
## 2 BIKE LANE 621
## 3 CONTRAFLOW 13
## 4 SHARED BUS BIKE 39
## 5 SHARROW 589
## 6 SIDEPATH 7
## 7 SIGNED ROUTE 304
## 8 <NA> 9
Modify your code from question 2.1 to break down each lane type by number of lanes. Use count()
on the columns named type
and numLanes
.
bike %>% count(type, numLanes)
## # A tibble: 16 × 3
## type numLanes n
## <chr> <dbl> <int>
## 1 BIKE BOULEVARD 1 1
## 2 BIKE BOULEVARD 2 48
## 3 BIKE LANE 0 20
## 4 BIKE LANE 1 411
## 5 BIKE LANE 2 190
## 6 CONTRAFLOW 1 7
## 7 CONTRAFLOW 2 6
## 8 SHARED BUS BIKE 1 39
## 9 SHARROW 1 217
## 10 SHARROW 2 372
## 11 SIDEPATH 1 6
## 12 SIDEPATH 2 1
## 13 SIGNED ROUTE 1 211
## 14 SIGNED ROUTE 2 93
## 15 <NA> 0 1
## 16 <NA> 1 8
How many bike lanes are there in each type of lane? Use group_by()
, summarize()
, and n()
on the column named type
.
# General format
DATA_TIBBLE %>%
group_by(GROUPING_COLUMN_NAME) %>%
summarize(SUMMARY_COLUMN_NAME = n())
bike %>%
group_by(type) %>%
summarize(count = n())
## # A tibble: 8 × 2
## type count
## <chr> <int>
## 1 BIKE BOULEVARD 49
## 2 BIKE LANE 621
## 3 CONTRAFLOW 13
## 4 SHARED BUS BIKE 39
## 5 SHARROW 589
## 6 SIDEPATH 7
## 7 SIGNED ROUTE 304
## 8 <NA> 9
Modify your code from 2.3 to also group by numLanes
.
bike %>%
group_by(type, numLanes) %>%
summarize(count = n())
## `summarise()` has grouped output by 'type'. You can override using the
## `.groups` argument.
## # A tibble: 16 × 3
## # Groups: type [8]
## type numLanes count
## <chr> <dbl> <int>
## 1 BIKE BOULEVARD 1 1
## 2 BIKE BOULEVARD 2 48
## 3 BIKE LANE 0 20
## 4 BIKE LANE 1 411
## 5 BIKE LANE 2 190
## 6 CONTRAFLOW 1 7
## 7 CONTRAFLOW 2 6
## 8 SHARED BUS BIKE 1 39
## 9 SHARROW 1 217
## 10 SHARROW 2 372
## 11 SIDEPATH 1 6
## 12 SIDEPATH 2 1
## 13 SIGNED ROUTE 1 211
## 14 SIGNED ROUTE 2 93
## 15 <NA> 0 1
## 16 <NA> 1 8
Modify code from 2.3 to also summarize by longest average bike lane length? In your summarized output, make sure you call the new summarized average bike lane length variable (column name) “mean”. In other words, the head of your output should look like:
# A tibble:
type count mean
<chr> <int> <dbl>
1 BIKE BOULEVARD 49 197.
...
bike %>%
group_by(type) %>%
summarize(
count = n(),
mean = mean(length)
)
## # A tibble: 8 × 3
## type count mean
## <chr> <int> <dbl>
## 1 BIKE BOULEVARD 49 197.
## 2 BIKE LANE 621 300.
## 3 CONTRAFLOW 13 136.
## 4 SHARED BUS BIKE 39 277.
## 5 SHARROW 589 244.
## 6 SIDEPATH 7 666.
## 7 SIGNED ROUTE 304 264.
## 8 <NA> 9 260.
Take your code from the above question P.4 and do the following:
%>%
)arrange()
to sort the output by the summarized column “mean”.bike %>%
group_by(type) %>%
summarize(
count = n(),
mean = mean(length)
) %>%
arrange(mean)
## # A tibble: 8 × 3
## type count mean
## <chr> <int> <dbl>
## 1 CONTRAFLOW 13 136.
## 2 BIKE BOULEVARD 49 197.
## 3 SHARROW 589 244.
## 4 <NA> 9 260.
## 5 SIGNED ROUTE 304 264.
## 6 SHARED BUS BIKE 39 277.
## 7 BIKE LANE 621 300.
## 8 SIDEPATH 7 666.