Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_R_class/data/Bike_Lanes.csv
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ dplyr 1.0.7
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(jhur)
library(tidyverse)
library(broom)
library(naniar)
bike <- jhur::read_bike()
## Rows: 1631 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
is.na()
and any()
functions to check if the bike dateInstalled
variable has any NA
values.any(is.na(bike$dateInstalled))
## [1] FALSE
#or
bike %>% pull(dateInstalled) %>% is.na() %>% any()
## [1] FALSE
drop_na
for the route
variable, assign this to the object have_route.
bike %>% drop_na(route)
## # A tibble: 362 × 9
## subType name block type numLanes project route length dateInstalled
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 <NA> <NA> <NA> SIDEP… 1 <NA> NORT… 1025. 2010
## 2 STRALY WINSTO… 1200 BL… SIGNE… 2 COLLEGET… COLL… 148. 2007
## 3 STRALY WINSTO… 1200 BL… SIGNE… 2 COLLEGET… COLL… 366. 2007
## 4 STRALY WINSTO… 1200 BL… SIGNE… 2 COLLEGET… COLL… 262. 2007
## 5 STRPRD <NA> <NA> SIGNE… 1 COLLEGET… COLL… 49.3 2007
## 6 STRPRD <NA> <NA> SIGNE… 1 COLLEGET… COLL… 70.0 2007
## 7 STRPRD <NA> <NA> SIGNE… 1 COLLEGET… COLL… 765. 2007
## 8 STRPRD <NA> <NA> SIGNE… 2 COLLEGET… COLL… 170. 2007
## 9 STRPRD <NA> <NA> SIGNE… 2 COLLEGET… COLL… 1724. 2007
## 10 STRPRD ALBEMA… 100 BLK… SIGNE… 1 SOUTHEAS… LITT… 250. 2011
## # … with 352 more rows
naniar
to make a visual of the amount of data missing for each variable of bike
(use naniar::gg_miss_var()
). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/naniar::gg_miss_var(bike)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
subType
of bike
? Hint: use another naniar
function.naniar::pct_complete(bike$subType) #this or..
## [1] 99.75475
pull(bike, subType) %>% pct_complete() # this
## [1] 99.75475
New Data set Now imagine we work in a clinic and we are trying to understand more about blood types of patients.
Let’s say we the data like so:
BloodType <- tibble(gender =
c("M", "Male", "Female", "F", "M",
"Male", "Other", "M", "F", "Other",
"F", "Male", NA, "Male", "Female"),
type = c("A.-", "AB.+", "O.-", "O.+", "AB.-",
"B.+", "B.-", "o.-", "O.+", "A.-",
"A.+", "O.-", "B.-", "o.+", "AB.-"),
infection = c("Yes", "No", "Yes", "No", "No",
"No", "Yes", "No", "Yes", "No",
"No", "Yes", "Yes", "Yes", "NotSure"))
BloodType
## # A tibble: 15 × 3
## gender type infection
## <chr> <chr> <chr>
## 1 M A.- Yes
## 2 Male AB.+ No
## 3 Female O.- Yes
## 4 F O.+ No
## 5 M AB.- No
## 6 Male B.+ No
## 7 Other B.- Yes
## 8 M o.- No
## 9 F O.+ Yes
## 10 Other A.- No
## 11 F A.+ No
## 12 Male O.- Yes
## 13 <NA> B.- Yes
## 14 Male o.+ Yes
## 15 Female AB.- NotSure
There are some issues with this data that we need to figure out!
count(BloodType, gender)
## # A tibble: 6 × 2
## gender n
## <chr> <int>
## 1 F 3
## 2 Female 2
## 3 M 3
## 4 Male 4
## 5 Other 2
## 6 <NA> 1
sum(is.na(pull(BloodType, gender)))
## [1] 1
BloodType %>% pull(gender) %>% is.na() %>% sum()
## [1] 1
gender
variable of the BloodType
data so that it is consistent. Use case_when()
.BloodType <- BloodType %>%
mutate(gender = case_when(gender %in% c("M", "m", "Male") ~ "Male",
gender %in% c("F", "female", "Female") ~ "Female",
gender %in% c("Other") ~ "Other"))
count(BloodType, gender)
## # A tibble: 4 × 2
## gender n
## <chr> <int>
## 1 Female 5
## 2 Male 7
## 3 Other 2
## 4 <NA> 1
Check to see how many values gender
has for each category (hint: use count
). It’s good practice to regularly check your data throughout the data wrangling process.
BloodType %>% count(gender)
## # A tibble: 4 × 2
## gender n
## <chr> <int>
## 1 Female 5
## 2 Male 7
## 3 Other 2
## 4 <NA> 1
type
variable of the BloodType
data to be consistent. Use recode
.BloodType <- BloodType %>%
mutate(type = recode(type, "o.-" = "O.-",
"o.+" = "O.+"))
BloodType
## # A tibble: 15 × 3
## gender type infection
## <chr> <chr> <chr>
## 1 Male A.- Yes
## 2 Male AB.+ No
## 3 Female O.- Yes
## 4 Female O.+ No
## 5 Male AB.- No
## 6 Male B.+ No
## 7 Other B.- Yes
## 8 Male O.- No
## 9 Female O.+ Yes
## 10 Other A.- No
## 11 Female A.+ No
## 12 Male O.- Yes
## 13 <NA> B.- Yes
## 14 Male O.+ Yes
## 15 Female AB.- NotSure
type
only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”BloodType %>% pull(type) %>% table(useNA = "ifany")
## .
## A.- A.+ AB.- AB.+ B.- B.+ O.- O.+
## 2 1 2 1 2 1 3 3
#or
BloodType %>% count(type)
## # A tibble: 8 × 2
## type n
## <chr> <int>
## 1 A.- 2
## 2 A.+ 1
## 3 AB.- 2
## 4 AB.+ 1
## 5 B.- 2
## 6 B.+ 1
## 7 O.- 3
## 8 O.+ 3
BloodType
called Bloodtype_split
that splits the type
variable into two called blood_type
and Rhfactor
. Note: periods are special characters that generally are interpreted as wild cards thus we need \\
to tell R that we want it to be interpreted as a period.##______________ <- ________ %>%
#________(____, ____ = c(__________, __________), sep = "\\.")
BloodType_split <- BloodType %>% separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")
Bonus: How many observations are there for each Rhfactor
:
table(BloodType_split$Rhfactor) # base R
##
## - +
## 9 6
count(BloodType_split,Rhfactor)
## # A tibble: 2 × 2
## Rhfactor n
## <chr> <int>
## 1 - 9
## 2 + 6
Bonus: Filtering for patients with type O, how many had the infection?
BloodType_split %>%
filter(blood_type == "O") %>%
count(infection)
## # A tibble: 2 × 2
## infection n
## <chr> <int>
## 1 No 2
## 2 Yes 4