Data used
Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv
library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(jhur)
library(tidyverse)
library(broom)
# install.packages("naniar")
library(naniar)
Read in the bike data, you can use the URL or download the data.
Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv
bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv")
## Rows: 1631 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the is.na()
and any()
functions to
check if the bike dateInstalled
variable has any
NA
values. Use the pipe between each step. Hint: You first
need to pull
out the vector version of this variable to use
the is.na()
function.
# General format
TIBBLE %>%
pull(COLUMN) %>%
is.na() %>%
any()
bike %>%
pull(dateInstalled) %>%
is.na() %>%
any()
## [1] FALSE
Clean rows of bike, so that only rows remain that do NOT have missing
values for the route
variable, using drop_na
.
Assign this to the object have_route.
have_rout <- bike %>% drop_na(route)
Use naniar
to make a visual of the amount of data
missing for each variable of bike
(use
gg_miss_var()
). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
gg_miss_var(bike)
What percentage of the subType
variable is complete of
bike
? Hint: use another naniar
function.
pull(bike, subType) %>% pct_complete() # this
## [1] 99.75475
miss_var_summary(bike) # or this
## # A tibble: 9 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 route 1269 77.8
## 2 block 215 13.2
## 3 project 74 4.54
## 4 name 12 0.736
## 5 type 9 0.552
## 6 subType 4 0.245
## 7 numLanes 0 0
## 8 length 0 0
## 9 dateInstalled 0 0
New Data set
Now imagine we work in a clinic and we are trying to understand more about blood types of patients.
Let’s say we the data like so:
BloodType <- tibble(
weight_loss =
c(
"Y", "No", "Yes", "y", "no",
"n", "No", "N", "yes", "Yes",
"No", "N", NA, "N", "Other"
),
type = c(
"A.-", "AB.+", "O.-", "O.+", "AB.-",
"B.+", "B.-", "o.-", "O.+", "A.-",
"A.+", "O.-", "B.-", "o.+", "AB.-"
),
infection = c(
"Yes", "No", "Yes", "No", "No",
"No", "Yes", "No", "Yes", "No",
"No", "Yes", "Yes", "Yes", "NotSure"
)
)
BloodType
## # A tibble: 15 × 3
## weight_loss type infection
## <chr> <chr> <chr>
## 1 Y A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 y O.+ No
## 5 no AB.- No
## 6 n B.+ No
## 7 No B.- Yes
## 8 N o.- No
## 9 yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 N O.- Yes
## 13 <NA> B.- Yes
## 14 N o.+ Yes
## 15 Other AB.- NotSure
There are some issues with this data that we need to figure out!
Determine how many NA
values there are for
weight_loss
(assume you know thatN
and
n
is for no).
count(BloodType, weight_loss) # the simple way
## # A tibble: 10 × 2
## weight_loss n
## <chr> <int>
## 1 N 3
## 2 No 3
## 3 Other 1
## 4 Y 1
## 5 Yes 2
## 6 n 1
## 7 no 1
## 8 y 1
## 9 yes 1
## 10 <NA> 1
sum(is.na(pull(BloodType, weight_loss))) # another way
## [1] 1
BloodType %>% # another way
pull(weight_loss) %>%
is.na() %>%
sum()
## [1] 1
Recode the weight_loss
variable of the
BloodType
data so that it is consistent. Use
case_when()
. Keep “Other” as “Other”. Don’t forget to use
quotes!
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
mutate(NEW_COLUMN = case_when(
OLD_COLUMN %in% c( ... ) ~ ... ,
OLD_COLUMN %in% c( ... ) ~ ... ,
TRUE ~ OLD_COLUMN
))
BloodType <- BloodType %>%
mutate(weight_loss = case_when(
weight_loss %in% c("N", "n", "No", "no") ~ "No",
weight_loss %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ weight_loss # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
))
count(BloodType, weight_loss)
## # A tibble: 4 × 2
## weight_loss n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Check to see how many values weight_loss
has for each
category (hint: use count
). It’s good practice to regularly
check your data throughout the data wrangling process.
BloodType %>% count(weight_loss)
## # A tibble: 4 × 2
## weight_loss n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Recode the type
variable of the BloodType
data to be consistent. Use case_when()
. Hint: the
inconsistency has to do with lower case o
and capital
O
. Don’t forget to use quotes! Remember that
important extra step that we often do for case_when()
.
Sometimes it matters and sometimes it doesn’t. Why is that?
BloodType <- BloodType %>%
mutate(type = case_when(
type == "o.-" ~ "O.-",
type == "o.+" ~ "O.+",
TRUE ~ type))
BloodType
## # A tibble: 15 × 3
## weight_loss type infection
## <chr> <chr> <chr>
## 1 Yes A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 Yes O.+ No
## 5 No AB.- No
## 6 No B.+ No
## 7 No B.- Yes
## 8 No O.- No
## 9 Yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 No O.- Yes
## 13 <NA> B.- Yes
## 14 No O.+ Yes
## 15 Other AB.- NotSure
Check to see that type
only has these possible values:
“A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”
BloodType %>% count(type)
## # A tibble: 8 × 2
## type n
## <chr> <int>
## 1 A.+ 1
## 2 A.- 2
## 3 AB.+ 1
## 4 AB.- 2
## 5 B.+ 1
## 6 B.- 2
## 7 O.+ 3
## 8 O.- 3
Make a new tibble of BloodType
called
Bloodtype_split
that splits the type
variable
into two called blood_type
and Rhfactor
. Note:
periods are special characters that generally are interpreted as wild
cards thus we need “\.” instead of simply “.” for the separating
character to tell R that we want it to be interpreted as a
period. Make sure you use quotes around “\.” and the column names like
shown below (don’t want backticks).
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
separate(OLD_COLUMN,
into = c("NEW_COLUMN1", "NEW_COLUMN2"),
sep = "SEPARATING_CHARACTER")
BloodType_split <- BloodType %>%
separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")
How many observations are there for each Rhfactor
in the
data object you just made:
count(BloodType_split, Rhfactor)
## # A tibble: 2 × 2
## Rhfactor n
## <chr> <int>
## 1 + 6
## 2 - 9
Filtering for patients with type O, how many had the infection?
BloodType_split %>%
filter(blood_type == "O") %>%
count(infection)
## # A tibble: 2 × 2
## infection n
## <chr> <int>
## 1 No 2
## 2 Yes 4