Part 1

Data used

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms

You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv

library(tidyverse)
# install.packages("naniar")
library(naniar)

Read in the bike data, you can use the URL or download the data and save the data as an object called bike.

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms

You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv

bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv")

## Rows: 1631 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.1

Use the is.na() and any() functions to check if the bike dateInstalled variable has any NA values. Use the pipe between each step. Hint: You first need to pull out the vector version of this variable to use the is.na() function.

# General format
TIBBLE %>%
  pull(COLUMN) %>%
  is.na() %>%
  any()

bike %>%
  pull(dateInstalled) %>%
  is.na() %>%
  any()

## [1] FALSE

1.2

Clean rows of bike, so that only rows remain that do NOT have missing values for the route variable, using drop_na. Assign this to the object have_route.

have_rout <- bike %>% drop_na(route)

1.3

Use naniar to make a visual of the amount of data missing for each variable of bike (use gg_miss_var() and use show_ptc = TRUE as an argument ). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/

gg_miss_var(bike, show_pct = TRUE)

Practice on Your Own!

P.1

What percentage of the subType variable is complete of bike ? Hint: use another naniar function.

pull(bike, subType) %>% pct_complete() # this

## [1] 99.75475

miss_var_summary(bike) # or this

## # A tibble: 9 × 3
##   variable      n_miss pct_miss
##   <chr>          <int>    <num>
## 1 route           1269   77.8  
## 2 block            215   13.2  
## 3 project           74    4.54 
## 4 name              12    0.736
## 5 type               9    0.552
## 6 subType            4    0.245
## 7 numLanes           0    0    
## 8 length             0    0    
## 9 dateInstalled      0    0

P.2

Use the na_if function to replace values of 0 values of thedateInstalled variable to be NA. Check your work using the count function.

bike <- bike %>% 
  mutate(dateInstalled = na_if(dateInstalled, 0))
count(bike, dateInstalled)

## # A tibble: 9 × 2
##   dateInstalled     n
##           <dbl> <int>
## 1          2006     2
## 2          2007   368
## 3          2008   206
## 4          2009    86
## 5          2010   625
## 6          2011   101
## 7          2012   107
## 8          2013    10
## 9            NA   126

Part 2

New Data set

Now imagine we work in a clinic and we are trying to understand more about blood types of patients.

Let’s say we the data like so:

BloodType <- tibble(
  exposure =
    c(
      "Y", "No", "Yes", "y", "no",
      "n", "No", "N", "yes", "Yes",
      "No", "N", NA, "N", "Other"
    ),
  type = c(
    "A.-", "AB.+", "O.-", "O.+", "AB.-",
    "B.+", "B.-", "o.-", "O.+", "A.-",
    "A.+", "O.-", "B.-", "o.+", "AB.-"
  ),
  infection = c(
    "Yes", "No", "Yes", "No", "No",
    "No", "Yes", "No", "Yes", "No",
    "No", "Yes", "Yes", "Yes", "NotSure"
  )
)

BloodType

## # A tibble: 15 × 3
##    exposure type  infection
##    <chr>    <chr> <chr>    
##  1 Y        A.-   Yes      
##  2 No       AB.+  No       
##  3 Yes      O.-   Yes      
##  4 y        O.+   No       
##  5 no       AB.-  No       
##  6 n        B.+   No       
##  7 No       B.-   Yes      
##  8 N        o.-   No       
##  9 yes      O.+   Yes      
## 10 Yes      A.-   No       
## 11 No       A.+   No       
## 12 N        O.-   Yes      
## 13 <NA>     B.-   Yes      
## 14 N        o.+   Yes      
## 15 Other    AB.-  NotSure

There are some issues with this data that we need to figure out!

2.1

Determine how many NA values there are for exposure (assume you know thatN and n is for no).

count(BloodType, exposure) # the simple way

## # A tibble: 10 × 2
##    exposure     n
##    <chr>    <int>
##  1 N            3
##  2 No           3
##  3 Other        1
##  4 Y            1
##  5 Yes          2
##  6 n            1
##  7 no           1
##  8 y            1
##  9 yes          1
## 10 <NA>         1

sum(is.na(pull(BloodType, exposure))) # another way

## [1] 1

BloodType %>% # another way
  pull(exposure) %>%
  is.na() %>%
  sum()

## [1] 1

2.2

Recode the exposure variable of the BloodType data so that it is consistent. Use case_when(). Keep “Other” as “Other”. Don’t forget to use quotes!

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  mutate(NEW_COLUMN = case_when(
    OLD_COLUMN %in% c( ... ) ~ ... ,
    OLD_COLUMN %in% c( ... ) ~ ... ,
    TRUE ~ OLD_COLUMN
  ))

BloodType <- BloodType %>%
  mutate(exposure = case_when(
    exposure %in% c("N", "n", "No", "no") ~ "No",
    exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
    TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
  ))

count(BloodType, exposure)

## # A tibble: 4 × 2
##   exposure     n
##   <chr>    <int>
## 1 No           8
## 2 Other        1
## 3 Yes          5
## 4 <NA>         1

2.3

Check to see how many values exposure has for each category (hint: use count). It’s good practice to regularly check your data throughout the data wrangling process.

BloodType %>% count(exposure)

## # A tibble: 4 × 2
##   exposure     n
##   <chr>    <int>
## 1 No           8
## 2 Other        1
## 3 Yes          5
## 4 <NA>         1

2.4

Recode the type variable of the BloodType data to be consistent. Use case_when(). Hint: the inconsistency has to do with lower case o and capital O. Don’t forget to use quotes! Remember that important extra step that we often do for case_when(). Sometimes it matters and sometimes it doesn’t. Why is that?

BloodType <- BloodType %>%
  mutate(type = case_when(
    type == "o.-" ~ "O.-",
    type == "o.+" ~ "O.+",
    TRUE ~ type))
BloodType

## # A tibble: 15 × 3
##    exposure type  infection
##    <chr>    <chr> <chr>    
##  1 Yes      A.-   Yes      
##  2 No       AB.+  No       
##  3 Yes      O.-   Yes      
##  4 Yes      O.+   No       
##  5 No       AB.-  No       
##  6 No       B.+   No       
##  7 No       B.-   Yes      
##  8 No       O.-   No       
##  9 Yes      O.+   Yes      
## 10 Yes      A.-   No       
## 11 No       A.+   No       
## 12 No       O.-   Yes      
## 13 <NA>     B.-   Yes      
## 14 No       O.+   Yes      
## 15 Other    AB.-  NotSure

2.5

Check to see that type only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”

BloodType %>% count(type)

## # A tibble: 8 × 2
##   type      n
##   <chr> <int>
## 1 A.+       1
## 2 A.-       2
## 3 AB.+      1
## 4 AB.-      2
## 5 B.+       1
## 6 B.-       2
## 7 O.+       3
## 8 O.-       3

2.6

Make a new tibble of BloodType called Bloodtype_split that splits the type variable into two called blood_type and Rhfactor. Note: periods are special characters that generally are interpreted as wild cards thus we need “\.” instead of simply “.” for the separating character to tell R that we want it to be interpreted as a period. Make sure you use quotes around “\.” and the column names like shown below (don’t want backticks).

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  separate(OLD_COLUMN,
           into = c("NEW_COLUMN1", "NEW_COLUMN2"),
           sep = "SEPARATING_CHARACTER")

BloodType_split <- BloodType %>%
  separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")

Practice on Your Own!

P.3

How many observations are there for each Rhfactor in the data object you just made:

count(BloodType_split, Rhfactor)

## # A tibble: 2 × 2
##   Rhfactor     n
##   <chr>    <int>
## 1 +            6
## 2 -            9

P.4

Filtering for patients with type O, how many had the infection?

BloodType_split %>%
  filter(blood_type == "O") %>%
  count(infection)

## # A tibble: 2 × 2
##   infection     n
##   <chr>     <int>
## 1 No            2
## 2 Yes           4

Data Cleaning Lab - Key

Part 1

1.1

1.2

1.3

Practice on Your Own!

P.1

P.2

Part 2

2.1

2.2

2.3

2.4

2.5

2.6

Practice on Your Own!

P.3

P.4