Data used

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms

You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_R_class/data/Bike_Lanes.csv

library(readr)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.7
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(jhur)
library(tidyverse)
library(broom)
library(naniar)

bike <- jhur::read_bike()

## Rows: 1631 Columns: 9

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Part 1

Use the is.na() and any() functions to check if the bike dateInstalled variable has any NA values.

any(is.na(bike$dateInstalled))

## [1] FALSE

#or
bike %>% pull(dateInstalled) %>% is.na() %>% any()

## [1] FALSE

Filter rows of bike that are NOT missing using drop_na for the route variable, assign this to the object have_route.

bike %>%  drop_na(route)

## # A tibble: 362 × 9
##    subType name    block    type   numLanes project   route length dateInstalled
##    <chr>   <chr>   <chr>    <chr>     <dbl> <chr>     <chr>  <dbl>         <dbl>
##  1 <NA>    <NA>    <NA>     SIDEP…        1 <NA>      NORT… 1025.           2010
##  2 STRALY  WINSTO… 1200 BL… SIGNE…        2 COLLEGET… COLL…  148.           2007
##  3 STRALY  WINSTO… 1200 BL… SIGNE…        2 COLLEGET… COLL…  366.           2007
##  4 STRALY  WINSTO… 1200 BL… SIGNE…        2 COLLEGET… COLL…  262.           2007
##  5 STRPRD  <NA>    <NA>     SIGNE…        1 COLLEGET… COLL…   49.3          2007
##  6 STRPRD  <NA>    <NA>     SIGNE…        1 COLLEGET… COLL…   70.0          2007
##  7 STRPRD  <NA>    <NA>     SIGNE…        1 COLLEGET… COLL…  765.           2007
##  8 STRPRD  <NA>    <NA>     SIGNE…        2 COLLEGET… COLL…  170.           2007
##  9 STRPRD  <NA>    <NA>     SIGNE…        2 COLLEGET… COLL… 1724.           2007
## 10 STRPRD  ALBEMA… 100 BLK… SIGNE…        1 SOUTHEAS… LITT…  250.           2011
## # … with 352 more rows

Use naniar to make a visual of the amount of data missing for each variable of bike (use naniar::gg_miss_var()). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/

naniar::gg_miss_var(bike)

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

What is the percent complete for subType of bike ? Hint: use another naniar function.

naniar::pct_complete(bike$subType) #this or..

## [1] 99.75475

pull(bike, subType) %>% pct_complete() # this

## [1] 99.75475

Part 2

New Data set Now imagine we work in a clinic and we are trying to understand more about blood types of patients.

Let’s say we the data like so:

BloodType <- tibble(gender = 
                    c("M", "Male", "Female", "F", "M", 
                      "Male", "Other", "M", "F", "Other", 
                      "F", "Male", NA, "Male", "Female"), 
                    type = c("A.-", "AB.+", "O.-", "O.+", "AB.-",
                               "B.+", "B.-", "o.-", "O.+", "A.-",
                              "A.+", "O.-", "B.-", "o.+", "AB.-"),
                    infection = c("Yes", "No", "Yes", "No", "No",
                                  "No", "Yes", "No", "Yes", "No",
                                  "No", "Yes", "Yes", "Yes", "NotSure"))

BloodType

## # A tibble: 15 × 3
##    gender type  infection
##    <chr>  <chr> <chr>    
##  1 M      A.-   Yes      
##  2 Male   AB.+  No       
##  3 Female O.-   Yes      
##  4 F      O.+   No       
##  5 M      AB.-  No       
##  6 Male   B.+   No       
##  7 Other  B.-   Yes      
##  8 M      o.-   No       
##  9 F      O.+   Yes      
## 10 Other  A.-   No       
## 11 F      A.+   No       
## 12 Male   O.-   Yes      
## 13 <NA>   B.-   Yes      
## 14 Male   o.+   Yes      
## 15 Female AB.-  NotSure

There are some issues with this data that we need to figure out!

Determine how many NA values there are for gender.

count(BloodType, gender)

## # A tibble: 6 × 2
##   gender     n
##   <chr>  <int>
## 1 F          3
## 2 Female     2
## 3 M          3
## 4 Male       4
## 5 Other      2
## 6 <NA>       1

sum(is.na(pull(BloodType, gender)))

## [1] 1

BloodType %>% pull(gender) %>% is.na() %>% sum()

## [1] 1

Recode the gender variable of the BloodType data so that it is consistent. Use case_when().

BloodType <- BloodType %>%
  mutate(gender = case_when(gender %in% c("M", "m", "Male") ~ "Male", 
                            gender %in% c("F", "female", "Female") ~ "Female",
                            gender %in% c("Other") ~ "Other"))

count(BloodType, gender)

## # A tibble: 4 × 2
##   gender     n
##   <chr>  <int>
## 1 Female     5
## 2 Male       7
## 3 Other      2
## 4 <NA>       1

Check to see how many values gender has for each category (hint: use count). It’s good practice to regularly check your data throughout the data wrangling process.

BloodType %>% count(gender)

## # A tibble: 4 × 2
##   gender     n
##   <chr>  <int>
## 1 Female     5
## 2 Male       7
## 3 Other      2
## 4 <NA>       1

Recode the type variable of the BloodType data to be consistent. Use recode.

BloodType <- BloodType %>%
  mutate(type = recode(type, "o.-" = "O.-", 
                             "o.+" = "O.+"))
BloodType

## # A tibble: 15 × 3
##    gender type  infection
##    <chr>  <chr> <chr>    
##  1 Male   A.-   Yes      
##  2 Male   AB.+  No       
##  3 Female O.-   Yes      
##  4 Female O.+   No       
##  5 Male   AB.-  No       
##  6 Male   B.+   No       
##  7 Other  B.-   Yes      
##  8 Male   O.-   No       
##  9 Female O.+   Yes      
## 10 Other  A.-   No       
## 11 Female A.+   No       
## 12 Male   O.-   Yes      
## 13 <NA>   B.-   Yes      
## 14 Male   O.+   Yes      
## 15 Female AB.-  NotSure

Check to see that type only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”

BloodType %>% pull(type) %>% table(useNA = "ifany")

## .
##  A.-  A.+ AB.- AB.+  B.-  B.+  O.-  O.+ 
##    2    1    2    1    2    1    3    3

 #or
BloodType %>% count(type)

## # A tibble: 8 × 2
##   type      n
##   <chr> <int>
## 1 A.-       2
## 2 A.+       1
## 3 AB.-      2
## 4 AB.+      1
## 5 B.-       2
## 6 B.+       1
## 7 O.-       3
## 8 O.+       3

Make a new tibble of BloodType called Bloodtype_split that splits the type variable into two called blood_type and Rhfactor. Note: periods are special characters that generally are interpreted as wild cards thus we need \\ to tell R that we want it to be interpreted as a period.

##______________ <- ________ %>% 
#________(____, ____ = c(__________, __________), sep = "\\.")

BloodType_split  <- BloodType %>% separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")

Bonus: How many observations are there for each Rhfactor:

table(BloodType_split$Rhfactor) # base R

## 
## - + 
## 9 6

count(BloodType_split,Rhfactor)

## # A tibble: 2 × 2
##   Rhfactor     n
##   <chr>    <int>
## 1 -            9
## 2 +            6

Bonus: Filtering for patients with type O, how many had the infection?

BloodType_split %>%
  filter(blood_type == "O") %>%
  count(infection)

## # A tibble: 2 × 2
##   infection     n
##   <chr>     <int>
## 1 No            2
## 2 Yes           4

Data Cleaning Lab Key

Data used

Part 1

Part 2