Data used
Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv
library(tidyverse)
# install.packages("naniar")
library(naniar)
Read in the bike data, you can use the URL or download the data and save the data as an object called bike
.
Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv
bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv")
## Rows: 1631 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the is.na()
and any()
functions to check if the bike dateInstalled
variable has any NA
values. Use the pipe between each step. Hint: You first need to pull
out the vector version of this variable to use the is.na()
function.
# General format
TIBBLE %>%
pull(COLUMN) %>%
is.na() %>%
any()
bike %>%
pull(dateInstalled) %>%
is.na() %>%
any()
## [1] FALSE
Clean rows of bike, so that only rows remain that do NOT have missing values for the route
variable, using drop_na
. Assign this to the object have_route.
have_rout <- bike %>% drop_na(route)
Use naniar
to make a visual of the amount of data missing for each variable of bike
(use gg_miss_var()
and use show_ptc = TRUE
as an argument ). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
gg_miss_var(bike, show_pct = TRUE)
What percentage of the subType
variable is complete of bike
? Hint: use another naniar
function.
pull(bike, subType) %>% pct_complete() # this
## [1] 99.75475
miss_var_summary(bike) # or this
## # A tibble: 9 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 route 1269 77.8
## 2 block 215 13.2
## 3 project 74 4.54
## 4 name 12 0.736
## 5 type 9 0.552
## 6 subType 4 0.245
## 7 numLanes 0 0
## 8 length 0 0
## 9 dateInstalled 0 0
Use the na_if
function to replace values of 0 values of thedateInstalled
variable to be NA
. Check your work using the count
function.
bike <- bike %>%
mutate(dateInstalled = na_if(dateInstalled, 0))
count(bike, dateInstalled)
## # A tibble: 9 × 2
## dateInstalled n
## <dbl> <int>
## 1 2006 2
## 2 2007 368
## 3 2008 206
## 4 2009 86
## 5 2010 625
## 6 2011 101
## 7 2012 107
## 8 2013 10
## 9 NA 126
New Data set
Now imagine we work in a clinic and we are trying to understand more about blood types of patients.
Let’s say we the data like so:
BloodType <- tibble(
exposure =
c(
"Y", "No", "Yes", "y", "no",
"n", "No", "N", "yes", "Yes",
"No", "N", NA, "N", "Other"
),
type = c(
"A.-", "AB.+", "O.-", "O.+", "AB.-",
"B.+", "B.-", "o.-", "O.+", "A.-",
"A.+", "O.-", "B.-", "o.+", "AB.-"
),
infection = c(
"Yes", "No", "Yes", "No", "No",
"No", "Yes", "No", "Yes", "No",
"No", "Yes", "Yes", "Yes", "NotSure"
)
)
BloodType
## # A tibble: 15 × 3
## exposure type infection
## <chr> <chr> <chr>
## 1 Y A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 y O.+ No
## 5 no AB.- No
## 6 n B.+ No
## 7 No B.- Yes
## 8 N o.- No
## 9 yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 N O.- Yes
## 13 <NA> B.- Yes
## 14 N o.+ Yes
## 15 Other AB.- NotSure
There are some issues with this data that we need to figure out!
Determine how many NA
values there are for exposure
(assume you know thatN
and n
is for no).
count(BloodType, exposure) # the simple way
## # A tibble: 10 × 2
## exposure n
## <chr> <int>
## 1 N 3
## 2 No 3
## 3 Other 1
## 4 Y 1
## 5 Yes 2
## 6 n 1
## 7 no 1
## 8 y 1
## 9 yes 1
## 10 <NA> 1
sum(is.na(pull(BloodType, exposure))) # another way
## [1] 1
BloodType %>% # another way
pull(exposure) %>%
is.na() %>%
sum()
## [1] 1
Recode the exposure
variable of the BloodType
data so that it is consistent. Use case_when()
. Keep “Other” as “Other”. Don’t forget to use quotes!
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
mutate(NEW_COLUMN = case_when(
OLD_COLUMN %in% c( ... ) ~ ... ,
OLD_COLUMN %in% c( ... ) ~ ... ,
TRUE ~ OLD_COLUMN
))
BloodType <- BloodType %>%
mutate(exposure = case_when(
exposure %in% c("N", "n", "No", "no") ~ "No",
exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
))
count(BloodType, exposure)
## # A tibble: 4 × 2
## exposure n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Check to see how many values exposure
has for each category (hint: use count
). It’s good practice to regularly check your data throughout the data wrangling process.
BloodType %>% count(exposure)
## # A tibble: 4 × 2
## exposure n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Recode the type
variable of the BloodType
data to be consistent. Use case_when()
. Hint: the inconsistency has to do with lower case o
and capital O
. Don’t forget to use quotes! Remember that important extra step that we often do for case_when()
. Sometimes it matters and sometimes it doesn’t. Why is that?
BloodType <- BloodType %>%
mutate(type = case_when(
type == "o.-" ~ "O.-",
type == "o.+" ~ "O.+",
TRUE ~ type))
BloodType
## # A tibble: 15 × 3
## exposure type infection
## <chr> <chr> <chr>
## 1 Yes A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 Yes O.+ No
## 5 No AB.- No
## 6 No B.+ No
## 7 No B.- Yes
## 8 No O.- No
## 9 Yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 No O.- Yes
## 13 <NA> B.- Yes
## 14 No O.+ Yes
## 15 Other AB.- NotSure
Check to see that type
only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”
BloodType %>% count(type)
## # A tibble: 8 × 2
## type n
## <chr> <int>
## 1 A.+ 1
## 2 A.- 2
## 3 AB.+ 1
## 4 AB.- 2
## 5 B.+ 1
## 6 B.- 2
## 7 O.+ 3
## 8 O.- 3
Make a new tibble of BloodType
called Bloodtype_split
that splits the type
variable into two called blood_type
and Rhfactor
. Note: periods are special characters that generally are interpreted as wild cards thus we need “\.” instead of simply “.” for the separating character to tell R that we want it to be interpreted as a period. Make sure you use quotes around “\.” and the column names like shown below (don’t want backticks).
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
separate(OLD_COLUMN,
into = c("NEW_COLUMN1", "NEW_COLUMN2"),
sep = "SEPARATING_CHARACTER")
BloodType_split <- BloodType %>%
separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")
How many observations are there for each Rhfactor
in the data object you just made:
count(BloodType_split, Rhfactor)
## # A tibble: 2 × 2
## Rhfactor n
## <chr> <int>
## 1 + 6
## 2 - 9
Filtering for patients with type O, how many had the infection?
BloodType_split %>%
filter(blood_type == "O") %>%
count(infection)
## # A tibble: 2 × 2
## infection n
## <chr> <int>
## 1 No 2
## 2 Yes 4