In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
⚠️ MOST IMPORTANT RULE - LOOK 👀 AT YOUR DATA! ⚠️
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
⚠️ MOST IMPORTANT RULE - LOOK 👀 AT YOUR DATA! ⚠️
One of the most important aspects of data cleaning is missing values.
Types of “missing” data:
NA - general missing dataNaN - stands for “Not a Number”, happens when you do 0/0.Inf and -Inf - Infinity, happens when you take a positive number (or negative number) by 0.Each missing data type has a function that returns TRUE if the data is missing:
NA - is.naNaN - is.nanInf and -Inf - is.infiniteis.na - is TRUE if the data is FALSE otherwise! - negation (NOT)
is.na(x) is TRUE, then !is.na(x) is FALSEany will be TRUE if ANY are true
any(is.na(x)) - do we have any NA’s in x?A = c(1, 2, 4, NA) B = c(1, 2, 3, 4) any(is.na(A)) # are there any NAs - YES/TRUE
[1] TRUE
any(is.na(B)) # are there any NAs- NO/FALSE
[1] FALSE
Sometimes you need to look at lots of data though… the naniar package is a good option.
The pct_complete() function shows the percentage that is complete for a given data object.
#install.packages("naniar")
library(naniar)
x = c(0, NA, 2, 3, 4, -0.5, 0.2)
naniar::pct_complete(x)
[1] 85.71429
The airquality dataset comes with R about air quality in New York in 1973.
?airquality # use this to find out more about the data airqual <-tibble(airquality) airqual
# A tibble: 153 × 6 Ozone Solar.R Wind Temp Month Day <int> <int> <dbl> <int> <int> <int> 1 41 190 7.4 67 5 1 2 36 118 8 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 # … with 143 more rows
pct_complete()pct_complete(airquality)
[1] 95.20697
The gg_miss_var() function creates a nice plot about the number of missing values for each variable.
naniar::gg_miss_var(airqual)
We can use the facet argument to make more plots about a specific variable.
naniar::gg_miss_var(airqual, facet = Month)
Recall that mathematical operations with NA often result in NAs.
sum(c(1,2,3,NA))
[1] NA
mean(c(2,4,NA))
[1] NA
median(c(1,2,3,NA))
[1] NA
This is also true for logicals. This is a good thing. The NA data could be > 2 or not, we don’t know, so R says there is no TRUE or FALSE, so that is missing.
x = c(0, NA, 2, 3, 4, -0.5, 0.2) x > 2
[1] FALSE NA FALSE TRUE TRUE FALSE FALSE
Be careful with missing data using subsetting:
filter() removes missing values by default. To keep them need to add is.na():
x # looks like the 1st and 3rd element should be TRUE
[1] 0.0 NA 2.0 3.0 4.0 -0.5 0.2
x %in% c(0, 2) # uh oh - not good!
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
x %in% c(0, 2) | is.na(x) # do this
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE
df
# A tibble: 6 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 NA 8
3 2 6
4 3 NA
5 1 2
6 1 NA
df %>% filter(Dog < 3)
# A tibble: 4 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 2 6
3 1 2
4 1 NA
drop_na()Avoid using filter for NA values. Instead use drop_na()
df %>% drop_na(Dog)
# A tibble: 5 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 2 6
3 3 NA
4 1 2
5 1 NA
!NAdoes not work as you might expect because you can't tell if something is not actuallyNA- R doesn't ever assume to know what the value ofNA` is
NA == NA
[1] NA
NA != NA
[1] NA
This function will drop rows with any missing data in any column when used on a df.
df
# A tibble: 6 × 2
Dog Cat
<dbl> <dbl>
1 0 NA
2 NA 8
3 2 6
4 3 NA
5 1 2
6 1 NA
drop_na(df)
# A tibble: 2 × 2
Dog Cat
<dbl> <dbl>
1 2 6
2 1 2
NASometimes removing NA values leads to distorted math - be careful! Think about what your NA means for your data (are you sure ?).
Is an NA for values so low they could not be reported? Or is it this and also if there was a different issue?
NAIf it is something more like a zero then you might want it included in your data like a zero.
Example: - survey reports NA if student has never tried cigarettes - survey reports 0 if student has tried cigarettes but did not smoke that week
You might want to keep the NA values so that you know the original sample size.
Calculating percentages will give you a different result depending on your choice to include NA values.
red_blue
# A tibble: 3 × 2 color col_count <chr> <int> 1 blue 3 2 red 3 3 <NA> 3
red_blue %>% mutate(percent =
col_count/sum(pull(red_blue, col_count)))
# A tibble: 3 × 3 color col_count percent <chr> <int> <dbl> 1 blue 3 0.333 2 red 3 0.333 3 <NA> 3 0.333
red_blue %>% mutate(percent =
col_count/sum(pull(drop_na(red_blue), col_count)))
# A tibble: 3 × 3 color col_count percent <chr> <int> <dbl> 1 blue 3 0.5 2 red 3 0.5 3 <NA> 3 0.5
# Should you be dividing by 9 or 6? It depends on your data # Pay attention to your data and your NAs!
Check the values for your variables, are they what you expect?
count() is a great option because it gives tells you:
Check if rare values make sense
bike <-jhur::read_bike() bike %>% count(subType)
# A tibble: 4 × 2 subType n <chr> <int> 1 STCLN 1 2 STRALY 3 3 STRPRD 1623 4 <NA> 4
Say we have some data about samples in a diet study:
data_diet
# A tibble: 12 × 4 Diet Gender Weight_start Weight_change <chr> <chr> <int> <int> 1 A Male 139 -1 2 B m 105 13 3 B Other 203 -9 4 A F 160 -3 5 B Female 102 12 6 B M 233 0 7 A f 151 -6 8 B O 183 11 9 B Man 118 -4 10 A f 189 1 11 B F 132 10 12 B O 104 15
This needs lots of recoding.
data_diet %>% count(Gender, Diet)
# A tibble: 10 × 3 Gender Diet n <chr> <chr> <int> 1 f A 2 2 F A 1 3 F B 1 4 Female B 1 5 m B 1 6 M B 1 7 Male A 1 8 Man B 1 9 O B 2 10 Other B 1
dplyr can help!Using Excel to find all of the different ways gender has been coded, would be a matter of filtering and changing all by hand or using if statements. This can be hectic!
In dplyr you can use the recode function (need mutate here too!):
# General Format - this is not code!
{data_input} %>%
mutate({variable_to_fix} = {Variable_fixing, {old_value} = {new_value},
{another_old_value} = {new_value})
data_diet %>%
mutate(Gender = recode(Gender, M = "Male",
m = "Male",
Man = "Male",
O = "Other",
f = "Female",
F = "Female")) %>%
count(Gender, Diet)
case_when().The case_when() function of dplyr can help us to do this as well.
# General Format - this is not code!
{data_input} %>%
mutate({variable_to_fix} = case_when{Variable_fixing}condition
~ {value_for_cond}))
Note that automatically values not reassigned explicitly by case_when will be NA.
case_when()data_diet %>% mutate(Gender = case_when(Gender =="M" ~ "Male"))
# A tibble: 12 × 4 Diet Gender Weight_start Weight_change <chr> <chr> <int> <int> 1 A <NA> 139 -1 2 B <NA> 105 13 3 B <NA> 203 -9 4 A <NA> 160 -3 5 B <NA> 102 12 6 B Male 233 0 7 A <NA> 151 -6 8 B <NA> 183 11 9 B <NA> 118 -4 10 A <NA> 189 1 11 B <NA> 132 10 12 B <NA> 104 15
data_diet %>%
mutate(Gender = case_when(
Gender %in% c("M", "male", "Man", "m", "Male") ~ "Male",
Gender %in% c("F", "Female", "f", "female")~ "Female",
Gender %in% c("O", "Other") ~ "Other"))
# A tibble: 12 × 4 Diet Gender Weight_start Weight_change <chr> <chr> <int> <int> 1 A Male 139 -1 2 B Male 105 13 3 B Other 203 -9 4 A Female 160 -3 5 B Female 102 12 6 B Male 233 0 7 A Female 151 -6 8 B Other 183 11 9 B Male 118 -4 10 A Female 189 1 11 B Female 132 10 12 B Other 104 15
case_when()case_when can do very sophisticated comparisons
data_diet <-data_diet %>%
mutate(Effect = case_when(Weight_change > 0 ~ "Increase",
Weight_change == 0 ~ "Same",
Weight_change < 0 ~ "Decrease"))
head(data_diet)
# A tibble: 6 × 5 Diet Gender Weight_start Weight_change Effect <chr> <chr> <int> <int> <chr> 1 A Male 139 -1 Decrease 2 B m 105 13 Increase 3 B Other 203 -9 Decrease 4 A F 160 -3 Decrease 5 B Female 102 12 Increase 6 B M 233 0 Same
# A tibble: 5 × 3 Diet Effect n <chr> <chr> <int> 1 A Decrease 3 2 A Increase 1 3 B Decrease 2 4 B Increase 5 5 B Same 1
diet_comb
# A tibble: 5 × 2 change n <chr> <int> 1 A_Decrease 3 2 A_Increase 1 3 B_Decrease 2 4 B_Increase 5 5 B_Same 1
tidyr, you can split a data set into multiple columns:diet_comb %>%
separate(change, into = c("Diet", "Change"))
# A tibble: 5 × 3 Diet Change n <chr> <chr> <int> 1 A Decrease 3 2 A Increase 1 3 B Decrease 2 4 B Increase 5 5 B Same 1
You can specify the separator with sep.
diet_comb %>%
separate(change, into = c("Diet", "Change"), sep = " ")
# A tibble: 5 × 3 Diet Change n <chr> <chr> <int> 1 A_diet Decrease 3 2 A_diet Increase 1 3 B_diet Decrease 2 4 B_diet Increase 5 5 B_diet Same 1
tidyr, you can unite:df = tibble(id = rep(1:5, 3), visit = rep(1:3, each = 5)) head(df, 4)
# A tibble: 4 × 2
id visit
<int> <int>
1 1 1
2 2 1
3 3 1
4 4 1
df_united <- df %>% unite(col = "unique_id", id, visit, sep = "_") head(df_united, 4)
# A tibble: 4 × 1 unique_id <chr> 1 1_1 2 2_1 3 3_1 4 4_1
stringr packageThe stringr package:
grep or gsub - base R functions
str_*stringrstr_detect, and str_replace search for matches to argument pattern within each element of a character vector (not data frame or tibble!).
str_detect - returns TRUE if pattern is found str_replace - replaces pattern with replacementFrom https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53, from https://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv
Read the CSV into R Sal:
Sal = jhur::read_salaries() # or
head(Sal)
# A tibble: 6 × 7 name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Aaron,Patricia G Faciliti… A03031 OED-Em… 10/24/1… $55314.00 $53626.… 2 Aaron,Petra L ASSISTAN… A29045 States… 09/25/2… $74000.00 $73000.… 3 Abaineh,Yohannes T EPIDEMIO… A65026 HLTH-H… 07/23/2… $64500.00 $64403.… 4 Abbene,Anthony M POLICE O… A99005 Police… 07/24/2… $46309.00 $59620.… 5 Abbey,Emmanuel CONTRACT… A40001 M-R In… 05/01/2… $60060.00 $54059.… 6 Abbott-Cole,Michelle CONTRACT… A90005 TRANS-… 11/28/2… $42702.00 $20250.…
str_detect() function: finding values: stringrSal %>% filter(str_detect(name, "Rawlings"))
# A tibble: 3 × 7 name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Rawlings,Ke… EMERGENCY D… A40302 M-R Info Te… 01/06/2… $48940.00 $73356.… 2 Rawlings,Pa… COMMUNITY A… A04015 R&P-Recreat… 12/10/2… $19802.00 $10443.… 3 Rawlings-Bl… MAYOR A01001 Mayors Offi… 12/07/1… $167449.00 $165249…
str_replace and str_replace_allstr_replace replaces only the first instance.
head(pull(Sal, JobTitle))
[1] "Facilities/Office Services II" "ASSISTANT STATE'S ATTORNEY" [3] "EPIDEMIOLOGIST" "POLICE OFFICER" [5] "CONTRACT SERV SPEC II" "CONTRACT SERV SPEC II"
head(str_replace(pull(Sal, JobTitle), "II", "2"))
[1] "Facilities/Office Services 2" "ASSISTANT STATE'S ATTORNEY" [3] "EPIDEMIOLOGIST" "POLICE OFFICER" [5] "CONTRACT SERV SPEC 2" "CONTRACT SERV SPEC 2"
str_replace replaces all instances.
head(str_replace_all(pull(Sal, name), "a", "j"), 2)
[1] "Ajron,Pjtricij G" "Ajron,Petrj L"
str_split(string, pattern) - splits strings up - returns list!library(stringr)
x <- c("I really like writing R code")
df = tibble(x = c("I really", "like writing", "R code programs"))
y <- unlist(str_split(x, " "))
y
[1] "I" "really" "like" "writing" "R" "code"
length(y)
[1] 6
. matches any single character* means repeat as many (even if 0) more times the last character? makes the last thing optional^ matches start of vector ^a - starts with “a”$ matches end of vector b$ - ends with “b”stringr?modifiers
fixed - match everything exactly ignore_case is an option to not have to use tolowerOne example case is when you want to split on a period “.”. In regular expressions . means ANY character, so we need to specify that we want R to interpret “.” as simply a period.
str_split("I.like.strings", ".")
[[1]] [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed("."))
[[1]] [1] "I" "like" "strings"
str_split("I.like.strings", "\\.")
[[1]] [1] "I" "like" "strings"
paste and paste0Paste can be very useful for joining vectors together:
paste("Visit", 1:5, sep = "_")
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
paste("Visit", 1:5, sep = "_", collapse = "_")
[1] "Visit_1_Visit_2_Visit_3_Visit_4_Visit_5"
# and paste0 can be even simpler see ?paste0
paste0("Visit",1:5) # no space!
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
!– # Before Cleaning - Subsetting with Brackets –>
–>
–> –> –>
head(str_subset( Sal$name, "^Payne.*"), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(str_subset( Sal$name, "Leonard.?S"))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(str_subset( Sal$name, "Spence.*C.*"))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
stringr to base R - not coveredstringr
str_split(string, pattern) - splits strings up - returns list!In stringr, str_split splits a vector on a string into a list
x <- c("I really", "like writing", "R code programs")
y <- stringr::str_split(x, pattern = " ") # returns a list
y
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
str_extractstr_extract extracts matched strings - \\d searches for DIGITS/numbers
head(Sal$AgencyID)
[1] "A03031" "A29045" "A65026" "A99005" "A40001" "A90005"
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
Base R does not use these functions. Here is a “translator” of the stringr function to base R functions
str_detect - similar to grepl (return logical)grep(value = FALSE) is similar to which(str_detect())str_subset - similar to grep(value = TRUE) - return value of matchedstr_replace - similar to sub - replace one timestr_replace_all - similar to gsub - replace many timesBase R:
(pattern, x)(fixed = TRUE)stringr
(string, pattern) aka (x, pattern)fixed(pattern)These are the indices where the pattern match occurs:
grep("Rawlings",Sal$Name)
Warning: Unknown or uninitialised column: `Name`.
integer(0)
which(grepl("Rawlings", Sal$Name))
Warning: Unknown or uninitialised column: `Name`.
integer(0)
which(str_detect(Sal$Name, "Rawlings"))
Warning: Unknown or uninitialised column: `Name`.
integer(0)
These are the indices where the pattern match occurs:
head(grepl("Rawlings",Sal$Name))
Warning: Unknown or uninitialised column: `Name`.
logical(0)
head(str_detect(Sal$Name, "Rawlings"))
Warning: Unknown or uninitialised column: `Name`.
logical(0)
grep("Rawlings",Sal$Name,value=TRUE)
Warning: Unknown or uninitialised column: `Name`.
character(0)
Sal[grep("Rawlings",Sal$Name),]
Warning: Unknown or uninitialised column: `Name`.
# A tibble: 0 × 7 # … with 7 variables: name <chr>, JobTitle <chr>, AgencyID <chr>, Agency <chr>, # HireDate <chr>, AnnualSalary <chr>, GrossPay <chr>
str_extractstr_extract extracts just the matched string
ss = str_extract(Sal$Name, "Rawling")
Warning: Unknown or uninitialised column: `Name`.
head(ss)
character(0)
ss[ !is.na(ss)]
character(0)
str_extract and str_extract_allstr_extract_all extracts all the matched strings
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "3" "0" "3" "1" [[2]] [1] "2" "9" "0" "4" "5"
head(grep("^Payne.*", x = Sal$name, value = TRUE), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(grep("Leonard.?S", x = Sal$name, value = TRUE))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(grep("Spence.*C.*", x = Sal$name, value = TRUE))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
stringrhead(str_subset( Sal$name, "^Payne.*"), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(str_subset( Sal$name, "Leonard.?S"))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(str_subset( Sal$name, "Spence.*C.*"))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
Let’s say we wanted to sort the data set by Annual Salary:
class(Sal$AnnualSalary)
[1] "character"
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
So we must change the annual pay into a numeric:
head(Sal$AnnualSalary, 4)
[1] "$55314.00" "$74000.00" "$64500.00" "$46309.00"
head(as.numeric(Sal$AnnualSalary), 4)
Warning in head(as.numeric(Sal$AnnualSalary), 4): NAs introduced by coercion
[1] NA NA NA NA
R didn’t like the $ so it thought turned them all to NA.
sub() and gsub() can do the replacing part in base R.
Now we can replace the $ with nothing (used fixed=TRUE because $ means ending):
Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="",
Sal$AnnualSalary, fixed=TRUE))
Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ]
Sal[1:5, c("name", "AnnualSalary", "JobTitle")]
# A tibble: 5 × 3 name AnnualSalary JobTitle <chr> <dbl> <chr> 1 Mosby,Marilyn J 238772 STATE'S ATTORNEY 2 Batts,Anthony W 211785 Police Commissioner 3 Wen,Leana 200000 Executive Director III 4 Raymond,Henry J 192500 Executive Director III 5 Swift,Michael 187200 CONTRACT SERV SPEC II
stringrWe can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(
fixed("$"),
"") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)
[1] TRUE
A two-way table. If you pass in 2 vectors, table creates a 2-dimensional table.
tab <- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3),
useNA = "always")
tab
0 1 2 3 4 <NA>
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 2 0 2 0
3 0 0 0 4 0 0
<NA> 0 0 0 0 0 0
tab_df = tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
y = c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3))
tab_df %>% count(x, y)
# A tibble: 5 × 3
x y n
<dbl> <dbl> <int>
1 0 0 1
2 1 1 1
3 2 2 2
4 2 4 2
5 3 3 4
tab_df %>% count(x, y) %>% group_by(x) %>% mutate(pct_x = n / sum(n))
# A tibble: 5 × 4
# Groups: x [4]
x y n pct_x
<dbl> <dbl> <int> <dbl>
1 0 0 1 1
2 1 1 1 1
3 2 2 2 0.5
4 2 4 2 0.5
5 3 3 4 1
library(scales) tab_df %>% count(x, y) %>% group_by(x) %>% mutate(pct_x = percent(n / sum(n)))
# A tibble: 5 × 4
# Groups: x [4]
x y n pct_x
<dbl> <dbl> <int> <chr>
1 0 0 1 100%
2 1 1 1 100%
3 2 2 2 50%
4 2 4 2 50%
5 3 3 4 100%