Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_R_class/data/Bike_Lanes.csv
library(readr)
library(dplyr)
library(tidyverse)
library(jhur)
bike = read_csv(
"http://jhudatascience.org/intro_to_R_class/data/Bike_Lanes.csv")
or use
library(jhur)
bike <- read_bike()
## Rows: 1631 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim()
or nrow()
or another function).nrow(bike)
## [1] 1631
dim(bike)
## [1] 1631 9
bike %>%
nrow()
## [1] 1631
length
column? (use sum()
)sum(bike$length)
## [1] 439447.6
sum(bike %>% pull(length))
## [1] 439447.6
max
and min
of length
using the summarize
function.bike %>% summarize(max = max(length),
min = min(length))
## # A tibble: 1 x 2
## max min
## <dbl> <dbl>
## 1 3749. 0
unique
, table
, or bike %>% count()
on the column named type
).# So many ways!!
bike %>% count(type)
## # A tibble: 8 x 2
## type n
## <chr> <int>
## 1 BIKE BOULEVARD 49
## 2 BIKE LANE 621
## 3 CONTRAFLOW 13
## 4 SHARED BUS BIKE 39
## 5 SHARROW 589
## 6 SIDEPATH 7
## 7 SIGNED ROUTE 304
## 8 <NA> 9
bike %>% pull(type) %>% table()
## .
## BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE SHARROW
## 49 621 13 39 589
## SIDEPATH SIGNED ROUTE
## 7 304
unique(bike %>% pull(type))
## [1] "BIKE BOULEVARD" "SIDEPATH" "SIGNED ROUTE" "BIKE LANE"
## [5] "SHARROW" NA "CONTRAFLOW" "SHARED BUS BIKE"
table(bike$type, useNA = "ifany")
##
## BIKE BOULEVARD BIKE LANE CONTRAFLOW SHARED BUS BIKE SHARROW
## 49 621 13 39 589
## SIDEPATH SIGNED ROUTE <NA>
## 7 304 9
unique(bike$type)
## [1] "BIKE BOULEVARD" "SIDEPATH" "SIGNED ROUTE" "BIKE LANE"
## [5] "SHARROW" NA "CONTRAFLOW" "SHARED BUS BIKE"
length(table(bike$type))
## [1] 7
length(unique(bike$type))
## [1] 8
is.na(unique(bike$type))
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
bike %>%
group_by(type) %>%
summarize(count = n())
## # A tibble: 8 x 2
## type count
## <chr> <int>
## 1 BIKE BOULEVARD 49
## 2 BIKE LANE 621
## 3 CONTRAFLOW 13
## 4 SHARED BUS BIKE 39
## 5 SHARROW 589
## 6 SIDEPATH 7
## 7 SIGNED ROUTE 304
## 8 <NA> 9
type
has (a) the most number of and (b) longest average bike lane length? (Hint: group_by
and summarize
). In your summarized output, make sure you call the new summarized average bike lane length variable (column name) “mean”. In other words, the head of your output should look like:# A tibble:
type number_of_rows mean
<chr> <int> <dbl>
1 BIKE BOULEVARD 49 197.
...
bike %>%
group_by(type) %>%
summarize(number_of_rows = n(),
mean = mean(length))
## # A tibble: 8 x 3
## type number_of_rows mean
## <chr> <int> <dbl>
## 1 BIKE BOULEVARD 49 197.
## 2 BIKE LANE 621 300.
## 3 CONTRAFLOW 13 136.
## 4 SHARED BUS BIKE 39 277.
## 5 SHARROW 589 244.
## 6 SIDEPATH 7 666.
## 7 SIGNED ROUTE 304 264.
## 8 <NA> 9 260.
%>%
)arrange()
to sort the output by the summarized column “mean”.bike %>%
group_by(type) %>%
summarize(number_of_rows = n(),
mean = mean(length)) %>%
arrange(mean)
## # A tibble: 8 x 3
## type number_of_rows mean
## <chr> <int> <dbl>
## 1 CONTRAFLOW 13 136.
## 2 BIKE BOULEVARD 49 197.
## 3 SHARROW 589 244.
## 4 <NA> 9 260.
## 5 SIGNED ROUTE 304 264.
## 6 SHARED BUS BIKE 39 277.
## 7 BIKE LANE 621 300.
## 8 SIDEPATH 7 666.
mutate
. This new column should be different for each year (dateInstalled
), and indicate the total sum number of lanes (numLanes
) built in that year. Call this new column year_total
and make sure to reassign the dataset. (hint: use group_by
first)bike <- bike %>%
group_by(dateInstalled) %>%
mutate(year_total = sum(numLanes, na.rm = TRUE))
type
? Ungroup your data when you are done.bike %>%
group_by(dateInstalled, type) %>%
mutate(year_total = sum(numLanes, na.rm = TRUE))
## # A tibble: 1,631 x 10
## # Groups: dateInstalled, type [28]
## subType name block type numLanes project route length dateInstalled
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 <NA> <NA> <NA> BIKE B… 1 GUILFOR… <NA> 436. 0
## 2 <NA> <NA> <NA> SIDEPA… 1 <NA> NORT… 1025. 2010
## 3 <NA> <NA> <NA> SIGNED… 1 SOUTHEA… <NA> 3749. 2010
## 4 <NA> HUNTIN… <NA> SIDEPA… 1 <NA> <NA> 0 0
## 5 STCLN EDMOND… 5300 BL… BIKE L… 1 OPERATI… <NA> 181. 2011
## 6 STRALY WINSTO… 1200 BL… SIGNED… 2 COLLEGE… COLL… 148. 2007
## 7 STRALY WINSTO… 1200 BL… SIGNED… 2 COLLEGE… COLL… 366. 2007
## 8 STRALY WINSTO… 1200 BL… SIGNED… 2 COLLEGE… COLL… 262. 2007
## 9 STRPRD <NA> <NA> BIKE L… 1 MAINTEN… <NA> 696. 2009
## 10 STRPRD <NA> <NA> SHARROW 1 COLLEGE… <NA> 43.1 2007
## # … with 1,621 more rows, and 1 more variable: year_total <dbl>
bike <- ungroup(bike)
length
variable in the bike
dataset. Try playing with the breaks=
argument.hist( pull(bike, length), breaks = 100)
dateInstalled
on the x axis and length
on the y axis.plot( pull(bike, dateInstalled), pull(bike, length) )
Bonus
A. Summarize the bike
data to get the mean of length
and dateInstalled
. Do this three ways: 1) with summarize
, 2) with summarize
and across
, and 3) with colMeans()
.
bike %>% summarize( mean_length = mean(length, na.rm = TRUE),
mean_date = mean(dateInstalled, na.rm = TRUE))
## # A tibble: 1 x 2
## mean_length mean_date
## <dbl> <dbl>
## 1 269. 1854.
bike %>%
summarize(across( c(length, dateInstalled), ~ mean(.x, na.rm = TRUE)))
## # A tibble: 1 x 2
## length dateInstalled
## <dbl> <dbl>
## 1 269. 1854.
bike %>% select(length, dateInstalled) %>% colMeans()
## length dateInstalled
## 269.4344 1853.9454
You should have gotten a mean date sometime in the 1800s - that doesn’t make much sense! Hypothesize why the average date is a date from before bike lanes were being built in Baltimore.
There are probably some zeros or other incorrect low values in the data.
B. Change any zeros in bike$dateInstalled
to NA
using mutate
. For the curious,ifelse()
in R works takes the same arguments as the “IF” function in Excel!
bike <- bike %>%
mutate(dateInstalled = ifelse(dateInstalled == 0, NA, dateInstalled))
bike$dateInstalled[bike$dateInstalled == "0"] <- NA
# How to find NAs?
# is.na(bike$dateInstalled)
# !is.na(bike$dateInstalled)
What is another way to remove zeros from the data?
Add a filtering step
C. What was the average bike lane length grouped by dateInstalled
? Remove NA
s with na.rm = TRUE
.
bike %>%
group_by(dateInstalled) %>%
summarise(mean_of_the_bike = mean(length, na.rm = F))
## # A tibble: 9 x 2
## dateInstalled mean_of_the_bike
## <dbl> <dbl>
## 1 2006 1469.
## 2 2007 310.
## 3 2008 249.
## 4 2009 407.
## 5 2010 246.
## 6 2011 233.
## 7 2012 271.
## 8 2013 290.
## 9 NA 215.
# Can combine 6 & 7!
mean(bike$length[ !is.na(bike$dateInstalled)])
## [1] 273.9943
bike %>%
mutate(length = ifelse(length == 0, NA, length)) %>%
group_by(dateInstalled) %>%
summarise(n = n(), # Add this column if you want!
mean_of_the_bike = mean(length, na.rm = TRUE),
n_missing = sum(is.na(length))) # Add this column if you want!
## # A tibble: 9 x 4
## dateInstalled n mean_of_the_bike n_missing
## <dbl> <int> <dbl> <int>
## 1 2006 2 1469. 0
## 2 2007 368 310. 0
## 3 2008 206 249. 0
## 4 2009 86 407. 0
## 5 2010 625 246. 0
## 6 2011 101 233. 0
## 7 2012 107 271. 0
## 8 2013 10 290. 0
## 9 NA 126 217. 1
D. Does the plot from question 10 improve if you remove the zeros?
plot( pull(bike, dateInstalled), pull(bike, length) )
Yes!
E. What kind of plot would be better for showing the length by each year group? Make this plot.
A boxplot would be more appropriate since year behaves more as a category than a continuous number.
boxplot( pull(bike, length) ~ pull(bike, dateInstalled) )