So far we’ve seen many functions, like c(), class(), filter(), dim() …
Why create your own functions?
There may be code that you use multiple times. Creating a function can help cut down on repetitive code (and the chance for copy/paste errors).
data_insights <- function(x, column1, column2) {
x_insight <- x %>%
group_by({{column1}}) %>%
summarize(mean = mean({{column2}}, na.rm = TRUE))
return(x_insight)
}
data_insights(x = mtcars, column1 = cyl, column2 = hp)
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 82.6
2 6 122.
3 8 209.
You may have a similar plot that you want to examine across columns of data.
simple_plots <- function(x, column1, column2) {
box_plot <- ggplot(data = x, aes(x = {{column1}}, y = {{column2}}, group = {{column1}})) +
geom_boxplot()
return(box_plot)
}
simple_plots(x = mtcars, column1 = cyl, column2 = hp)
The general syntax for a function is:
function_name <- function(arg1, arg2, ...) {
<function body>
}
Here we will write a function that multiplies some number x by 2:
times_2 <- function(x) x * 2
When you run the line of code above, you make it ready to use (no output yet!). Let’s test it!
times_2(x = 10)
[1] 20
{ }Adding the curly brackets - {} - allows you to use functions spanning multiple lines:
times_2 <- function(x) {
x * 2
}
times_2(x = 10)
[1] 20
is_even <- function(x) {
x %% 2 == 0
}
is_even(x = 11)
[1] FALSE
is_even(x = times_2(x = 10))
[1] TRUE
returnIf we want something specific for the function’s output, we use return():
times_2_plus_4 <- function(x) {
output_int <- x * 2
output <- output_int + 4
return(output)
}
times_2_plus_4(x = 10)
[1] 24
return not called, last evaluated expression is returnedreturn should be the last step (steps after may be skipped)times_2_plus_4 <- function(x) {
output_int <- x * 2
output <- output_int + 4
print(paste("times2 result = ", output_int))
return(output)
}
result <- times_2_plus_4(x = 10)
[1] "times2 result = 20"
result
[1] 24
Functions can take multiple inputs:
times_2_plus_y <- function(x, y) x * 2 + y times_2_plus_y(x = 10, y = 3)
[1] 23
Functions can have one returned result with multiple outputs.
x_and_y_plus_2 <- function(x, y) {
output1 <- x + 2
output2 <- y + 2
return(c(output1, output2))
}
result <- x_and_y_plus_2(x = 10, y = 3)
result
[1] 12 5
Functions can have “default” arguments. This lets us use the function without using an argument later:
times_2_plus_y <- function(x = 10, y = 3) x * 2 + y times_2_plus_y()
[1] 23
times_2_plus_y(x = 11, y = 4)
[1] 26
Let’s write a function, sqdif, that:
x and y with default values of 2 and 3.Functions can have any kind of input. Here is a function with characters:
loud <- function(word) {
output <- rep(toupper(word), 5)
return(output)
}
loud(word = "hooray!")
[1] "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!"
select(n) will choose column n:
get_index <- function(dat, row, col) {
dat %>%
filter(row_number() == row) %>%
select(all_of(col))
}
get_index(dat = iris, row = 10, col = 5)
Species 1 setosa
Including default values for arguments:
get_top <- function(dat, row = 1, col = 1) {
dat %>%
filter(row_number() == row) %>%
select(all_of(col))
}
get_top(dat = iris)
Sepal.Length 1 5.1
Can create function with an argument that allows inputting a column name for select or other dplyr operation:
clean_dataset <- function(dataset, col_name) {
my_data_out <- dataset %>% select({{col_name}}) # Note the curly braces {{}}
write_csv(my_data_out, "clean_data.csv")
return(my_data_out)
}
clean_dataset(dataset = mtcars, col_name = "cyl")
cyl Mazda RX4 6 Mazda RX4 Wag 6 Datsun 710 4 Hornet 4 Drive 6 Hornet Sportabout 8 Valiant 6 Duster 360 8 Merc 240D 4 Merc 230 4 Merc 280 6 Merc 280C 6 Merc 450SE 8 Merc 450SL 8 Merc 450SLC 8 Cadillac Fleetwood 8 Lincoln Continental 8 Chrysler Imperial 8 Fiat 128 4 Honda Civic 4 Toyota Corolla 4 Toyota Corona 4 Dodge Challenger 8 AMC Javelin 8 Camaro Z28 8 Pontiac Firebird 8 Fiat X1-9 4 Porsche 914-2 4 Lotus Europa 4 Ford Pantera L 8 Ferrari Dino 6 Maserati Bora 8 Volvo 142E 4
# Another example: get means and missing for a specific column
get_summary <- function(dataset, col_name) {
dataset %>%
summarise(mean = mean({{col_name}}, na.rm = TRUE),
na_count = sum(is.na({{col_name}})))
}
get_summary(mtcars, hp)
mean na_count 1 146.6875 0
NEW_FUNCTION <- function(x, y){x + y}function(x = 1, y = 2){x + y} -return will provide a value as outputprint will simply print the value on the screen but not save it{{double curly braces}}💻 Lab
sapply()- a base R functionNow that you’ve made a function… you can “apply” functions easily with sapply()!
These functions take the form:
sapply(<a vector, list, data frame>, some_function)
sapply()🚨 There are no parentheses on the functions! 🚨
You can also pipe into your function.
head(iris, n = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa
sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
iris %>% sapply(class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
sapply()cars <- read_csv("https://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv")
select(cars, VehYear:VehicleAge) %>% head()
# A tibble: 6 × 2
VehYear VehicleAge
<dbl> <dbl>
1 2006 3
2 2004 5
3 2005 4
4 2004 5
5 2005 4
6 2004 5
select(cars, VehYear:VehicleAge) %>% sapply(times_2) %>% head()
VehYear VehicleAge [1,] 4012 6 [2,] 4008 10 [3,] 4010 8 [4,] 4008 10 [5,] 4010 8 [6,] 4008 10
Also called an “anonymous function”.
select(cars, VehYear:VehicleAge) %>% sapply(function(x) x / 1000) %>% head()
VehYear VehicleAge [1,] 2.006 0.003 [2,] 2.004 0.005 [3,] 2.005 0.004 [4,] 2.004 0.005 [5,] 2.005 0.004 [6,] 2.004 0.005
select(cars, VehYear:VehicleAge) %>% sapply(\(x) x / 1000) %>% head()
VehYear VehicleAge [1,] 2.006 0.003 [2,] 2.004 0.005 [3,] 2.005 0.004 [4,] 2.004 0.005 [5,] 2.005 0.004 [6,] 2.004 0.005
mutate() and summarize()Already know how to use functions to modify columns using mutate() or calculate summary statistics using summarize().
cars %>%
mutate(VehOdo_round = round(VehOdo, -3)) %>%
summarize(max_Odo_round = max(VehOdo_round),
max_Odo = max(VehOdo))
# A tibble: 1 × 2
max_Odo_round max_Odo
<dbl> <dbl>
1 116000 115717
across() functionImage by Allison Horst.
across from dplyracross() makes it easy to apply the same transformation to multiple columns. Usually used with summarize() or mutate().
summarize(across( .cols = <columns>, .fns = function))
or
mutate(across(.cols = <columns>, .fns = function))
.cols =.fns =na.rm = TRUE), the function may need to be modified to an anonymous function, e.g., \(x) mean(x, na.rm = TRUE)across from dplyrCombining with summarize()
cars_dbl <- cars %>% select(Make, starts_with("Veh"))
cars_dbl %>%
summarize(across(.cols = everything(), .fns = mean)) # no parentheses
# A tibble: 1 × 5 Make VehYear VehicleAge VehOdo VehBCost <dbl> <dbl> <dbl> <dbl> <dbl> 1 NA 2005. 4.18 71500. 6731.
across from dplyrCan use with other tidyverse functions like group_by!
cars_dbl %>% group_by(Make) %>% summarize(across(.cols = everything(), .fns = mean)) # no parentheses
# A tibble: 33 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 ACURA 2003. 6.52 81732. 9039. 2 BUICK 2004. 5.65 76238. 6169. 3 CADILLAC 2004. 5.24 73770. 10958. 4 CHEVROLET 2006. 3.97 73390. 6835. 5 CHRYSLER 2006. 3.65 66814. 6507. 6 DODGE 2006. 3.75 68261. 7047. 7 FORD 2005. 4.75 76749. 6403. 8 GMC 2004. 5.61 79273. 8342. 9 HONDA 2004. 5.33 77877. 8350. 10 HUMMER 2006 3 70809 11920 # ℹ 23 more rows
across from dplyrTo add arguments to functions, may need to use anonymous function. In this syntax, the shorthand \(x) is equivalent to function(x).
cars_dbl %>% group_by(Make) %>% summarize(across(.cols = everything(), .fns = \(x) mean(x, na.rm = TRUE)))
# A tibble: 33 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 ACURA 2003. 6.52 81732. 9039. 2 BUICK 2004. 5.65 76238. 6169. 3 CADILLAC 2004. 5.24 73770. 10958. 4 CHEVROLET 2006. 3.97 73390. 6835. 5 CHRYSLER 2006. 3.65 66814. 6507. 6 DODGE 2006. 3.75 68261. 7047. 7 FORD 2005. 4.75 76749. 6403. 8 GMC 2004. 5.61 79273. 8342. 9 HONDA 2004. 5.33 77877. 8350. 10 HUMMER 2006 3 70809 11920 # ℹ 23 more rows
across from dplyrUsing different tidyselect() options (e.g., starts_with(), ends_with(), contains())
cars_dbl %>%
group_by(Make) %>%
summarize(across(.cols = starts_with("Veh"), .fns = mean))
# A tibble: 33 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 ACURA 2003. 6.52 81732. 9039. 2 BUICK 2004. 5.65 76238. 6169. 3 CADILLAC 2004. 5.24 73770. 10958. 4 CHEVROLET 2006. 3.97 73390. 6835. 5 CHRYSLER 2006. 3.65 66814. 6507. 6 DODGE 2006. 3.75 68261. 7047. 7 FORD 2005. 4.75 76749. 6403. 8 GMC 2004. 5.61 79273. 8342. 9 HONDA 2004. 5.33 77877. 8350. 10 HUMMER 2006 3 70809 11920 # ℹ 23 more rows
across from dplyrCombining with mutate(): rounding to the nearest power of 10 (with negative digits value)
cars_dbl %>%
mutate(across(
.cols = starts_with("Veh"),
.fns = round,
digits = -3
))
# A tibble: 72,983 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 MAZDA 2000 0 89000 7000 2 DODGE 2000 0 94000 8000 3 DODGE 2000 0 74000 5000 4 DODGE 2000 0 66000 4000 5 FORD 2000 0 69000 4000 6 MITSUBISHI 2000 0 81000 6000 7 KIA 2000 0 65000 4000 8 FORD 2000 0 66000 4000 9 KIA 2000 0 50000 6000 10 FORD 2000 0 85000 8000 # ℹ 72,973 more rows
across from dplyrCombining with mutate() - the replace_na function
replace_na({data frame}, {list of values}) or replace_na({vector}, {single value})
# Child mortality data
mort <-
read_csv("https://jhudatascience.org/intro_to_r/data/mortality.csv") %>%
rename(country = `...1`)
mort %>%
select(country, starts_with("194")) %>%
mutate(across(
.cols = c(`1943`, `1944`, `1945`),
.fns = replace_na,
replace = 0
))
# A tibble: 197 × 11 country `1940` `1941` `1942` `1943` `1944` `1945` `1946` `1947` `1948` `1949` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghan… NA NA NA 0 0 0 NA NA NA NA 2 Albania 1.53 1.31 1.48 1.46 1.43 1.40 1.37 1.41 1.37 1.34 3 Algeria NA NA NA 0 0 0 NA NA NA NA 4 Angola 4.46 4.46 4.46 4.34 4.34 4.34 4.33 4.22 4.22 4.21 5 Argent… 0.641 0.603 0.602 0.558 0.551 0.510 0.503 0.496 0.494 0.492 6 Armenia NA NA NA 0 0 0 NA NA NA NA 7 Aruba NA NA NA 0 0 0 NA NA NA NA 8 Austra… 0.263 0.275 0.276 0.299 0.260 0.271 0.295 0.279 0.271 0.271 9 Austria 0.504 0.474 0.417 0.389 0.360 0.311 0.311 0.312 0.274 0.274 10 Azerba… NA NA NA 0 0 0 NA NA NA NA # ℹ 187 more rows
mutate and acrossIf your function needs to span more than one line, better to define it first before using inside mutate() and across().
times1000 <- function(x) x * 1000
airquality %>%
mutate(across(
.cols = everything(),
.fns = times1000
)) %>%
head(n = 2)
Ozone Solar.R Wind Temp Month Day 1 41000 190000 7400 67000 5000 1000 2 36000 118000 8000 72000 5000 2000
airquality %>%
mutate(across(
.cols = everything(),
.fns = function(x) x * 1000
)) %>%
head(n = 2)
Ozone Solar.R Wind Temp Month Day 1 41000 190000 7400 67000 5000 1000 2 36000 118000 8000 72000 5000 2000
Why use across()?
A. Efficiency - faster and less repetitive
B. Calculate the cross product
C. Connect across datasets
purrr packageSimilar to across, purrr is a package that allows you to apply a function to multiple columns in a data frame or multiple data objects in a list.
A list in R is a generic class of data consisting of an ordered collection of objects. It can include any number of single numeric objects, vectors, or data frames – can be all the same class of objects or all different.
While we won’t get into purrr too much in this class, its a handy package for you to know about should you get into a situation where you have an irregular list you need to handle!
Lists help us work with multiple data frames
AQ_list <- list(AQ1 = airquality, AQ2 = airquality, AQ3 = airquality) str(AQ_list)
List of 3 $ AQ1:'data.frame': 153 obs. of 6 variables: ..$ Ozone : int [1:153] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:153] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ... $ AQ2:'data.frame': 153 obs. of 6 variables: ..$ Ozone : int [1:153] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:153] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ... $ AQ3:'data.frame': 153 obs. of 6 variables: ..$ Ozone : int [1:153] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:153] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ...
sapplyAQ_list %>% sapply(class)
AQ1 AQ2 AQ3 "data.frame" "data.frame" "data.frame"
AQ_list %>% sapply(nrow)
AQ1 AQ2 AQ3 153 153 153
AQ_list %>% sapply(colMeans, na.rm = TRUE)
AQ1 AQ2 AQ3 Ozone 42.129310 42.129310 42.129310 Solar.R 185.931507 185.931507 185.931507 Wind 9.957516 9.957516 9.957516 Temp 77.882353 77.882353 77.882353 Month 6.993464 6.993464 6.993464 Day 15.803922 15.803922 15.803922
sapply(<a vector or list>, some_function)across() to apply functions across multiple columns of dataacross within summarize() or mutate()sapply or purrr to work with multiple data frames within lists simultaneously💻 Lab
Image by Gerd Altmann from Pixabay
Image by Allison Horst.