Part 1

Data used

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms

You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv

library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(jhur)
library(tidyverse)
library(broom)
# install.packages("naniar")
library(naniar)

Read in the bike data, you can use the URL or download the data.

Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms

You can Download as a CSV in your current working directory. Note its also available at: http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv

bike <- read_csv(file = "http://jhudatascience.org/intro_to_r/data/Bike_Lanes.csv")
## Rows: 1631 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.1

Use the is.na() and any() functions to check if the bike dateInstalled variable has any NA values. Use the pipe between each step. Hint: You first need to pull out the vector version of this variable to use the is.na() function.

# General format
TIBBLE %>%
  pull(COLUMN) %>%
  is.na() %>%
  any()
bike %>%
  pull(dateInstalled) %>%
  is.na() %>%
  any()
## [1] FALSE

1.2

Clean rows of bike, so that only rows remain that do NOT have missing values for the route variable, using drop_na. Assign this to the object have_route.

have_rout <- bike %>% drop_na(route)

1.3

Use naniar to make a visual of the amount of data missing for each variable of bike (use gg_miss_var()). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/

gg_miss_var(bike)

Practice on Your Own!

P.1

What percentage of the subType variable is complete of bike ? Hint: use another naniar function.

pull(bike, subType) %>% pct_complete() # this
## [1] 99.75475
miss_var_summary(bike) # or this
## # A tibble: 9 × 3
##   variable      n_miss pct_miss
##   <chr>          <int>    <num>
## 1 route           1269   77.8  
## 2 block            215   13.2  
## 3 project           74    4.54 
## 4 name              12    0.736
## 5 type               9    0.552
## 6 subType            4    0.245
## 7 numLanes           0    0    
## 8 length             0    0    
## 9 dateInstalled      0    0

Part 2

New Data set

Now imagine we work in a clinic and we are trying to understand more about blood types of patients.

Let’s say we the data like so:

BloodType <- tibble(
  weight_loss =
    c(
      "Y", "No", "Yes", "y", "no",
      "n", "No", "N", "yes", "Yes",
      "No", "N", NA, "N", "Other"
    ),
  type = c(
    "A.-", "AB.+", "O.-", "O.+", "AB.-",
    "B.+", "B.-", "o.-", "O.+", "A.-",
    "A.+", "O.-", "B.-", "o.+", "AB.-"
  ),
  infection = c(
    "Yes", "No", "Yes", "No", "No",
    "No", "Yes", "No", "Yes", "No",
    "No", "Yes", "Yes", "Yes", "NotSure"
  )
)

BloodType
## # A tibble: 15 × 3
##    weight_loss type  infection
##    <chr>       <chr> <chr>    
##  1 Y           A.-   Yes      
##  2 No          AB.+  No       
##  3 Yes         O.-   Yes      
##  4 y           O.+   No       
##  5 no          AB.-  No       
##  6 n           B.+   No       
##  7 No          B.-   Yes      
##  8 N           o.-   No       
##  9 yes         O.+   Yes      
## 10 Yes         A.-   No       
## 11 No          A.+   No       
## 12 N           O.-   Yes      
## 13 <NA>        B.-   Yes      
## 14 N           o.+   Yes      
## 15 Other       AB.-  NotSure

There are some issues with this data that we need to figure out!

2.1

Determine how many NA values there are for weight_loss (assume you know thatN and n is for no).

count(BloodType, weight_loss) # the simple way
## # A tibble: 10 × 2
##    weight_loss     n
##    <chr>       <int>
##  1 N               3
##  2 No              3
##  3 Other           1
##  4 Y               1
##  5 Yes             2
##  6 n               1
##  7 no              1
##  8 y               1
##  9 yes             1
## 10 <NA>            1
sum(is.na(pull(BloodType, weight_loss))) # another way
## [1] 1
BloodType %>% # another way
  pull(weight_loss) %>%
  is.na() %>%
  sum()
## [1] 1

2.2

Recode the weight_loss variable of the BloodType data so that it is consistent. Use case_when(). Keep “Other” as “Other”. Don’t forget to use quotes!

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  mutate(NEW_COLUMN = case_when(
    OLD_COLUMN %in% c( ... ) ~ ... ,
    OLD_COLUMN %in% c( ... ) ~ ... ,
    TRUE ~ OLD_COLUMN
  ))
BloodType <- BloodType %>%
  mutate(weight_loss = case_when(
    weight_loss %in% c("N", "n", "No", "no") ~ "No",
    weight_loss %in% c("Y", "y", "Yes", "yes") ~ "Yes",
    TRUE ~ weight_loss # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
  ))

count(BloodType, weight_loss)
## # A tibble: 4 × 2
##   weight_loss     n
##   <chr>       <int>
## 1 No              8
## 2 Other           1
## 3 Yes             5
## 4 <NA>            1

2.3

Check to see how many values weight_loss has for each category (hint: use count). It’s good practice to regularly check your data throughout the data wrangling process.

BloodType %>% count(weight_loss)
## # A tibble: 4 × 2
##   weight_loss     n
##   <chr>       <int>
## 1 No              8
## 2 Other           1
## 3 Yes             5
## 4 <NA>            1

2.4

Recode the type variable of the BloodType data to be consistent. Use case_when(). Hint: the inconsistency has to do with lower case o and capital O. Don’t forget to use quotes! Remember that important extra step that we often do for case_when(). Sometimes it matters and sometimes it doesn’t. Why is that?

BloodType <- BloodType %>%
  mutate(type = case_when(
    type == "o.-" ~ "O.-",
    type == "o.+" ~ "O.+",
    TRUE ~ type))
BloodType
## # A tibble: 15 × 3
##    weight_loss type  infection
##    <chr>       <chr> <chr>    
##  1 Yes         A.-   Yes      
##  2 No          AB.+  No       
##  3 Yes         O.-   Yes      
##  4 Yes         O.+   No       
##  5 No          AB.-  No       
##  6 No          B.+   No       
##  7 No          B.-   Yes      
##  8 No          O.-   No       
##  9 Yes         O.+   Yes      
## 10 Yes         A.-   No       
## 11 No          A.+   No       
## 12 No          O.-   Yes      
## 13 <NA>        B.-   Yes      
## 14 No          O.+   Yes      
## 15 Other       AB.-  NotSure

2.5

Check to see that type only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”

BloodType %>% count(type)
## # A tibble: 8 × 2
##   type      n
##   <chr> <int>
## 1 A.+       1
## 2 A.-       2
## 3 AB.+      1
## 4 AB.-      2
## 5 B.+       1
## 6 B.-       2
## 7 O.+       3
## 8 O.-       3

2.6

Make a new tibble of BloodType called Bloodtype_split that splits the type variable into two called blood_type and Rhfactor. Note: periods are special characters that generally are interpreted as wild cards thus we need “\.” instead of simply “.” for the separating character to tell R that we want it to be interpreted as a period. Make sure you use quotes around “\.” and the column names like shown below (don’t want backticks).

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  separate(OLD_COLUMN,
           into = c("NEW_COLUMN1", "NEW_COLUMN2"),
           sep = "SEPARATING_CHARACTER")
BloodType_split <- BloodType %>%
  separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")

Practice on Your Own!

P.2

How many observations are there for each Rhfactor in the data object you just made:

count(BloodType_split, Rhfactor)
## # A tibble: 2 × 2
##   Rhfactor     n
##   <chr>    <int>
## 1 +            6
## 2 -            9

P.3

Filtering for patients with type O, how many had the infection?

BloodType_split %>%
  filter(blood_type == "O") %>%
  count(infection)
## # A tibble: 2 × 2
##   infection     n
##   <chr>     <int>
## 1 No            2
## 2 Yes           4