So far we’ve seen many functions, like c()
, class()
, filter()
, dim()
…
Why create your own functions?
There may be code that you use multiple times. Creating a function can help cut down on repetitive code (and the chance for copy/paste errors).
data_insights <- function(x, column1, column2) { x_insight <- x %>% group_by({{column1}}) %>% summarize(mean = mean({{column2}}, na.rm = TRUE)) return(x_insight) } data_insights(x = mtcars, column1 = cyl, column2 = hp)
# A tibble: 3 × 2 cyl mean <dbl> <dbl> 1 4 82.6 2 6 122. 3 8 209.
You may have a similar plot that you want to examine across columns of data.
simple_plots <- function(x, column1, column2) { box_plot <- ggplot(data = x, aes(x = {{column1}}, y = {{column2}}, group = {{column1}})) + geom_boxplot() return(box_plot) } simple_plots(x = mtcars, column1 = cyl, column2 = hp)
The general syntax for a function is:
function_name <- function(arg1, arg2, ...) { <function body> }
Here we will write a function that multiplies some number x
by 2:
times_2 <- function(x) x * 2
When you run the line of code above, you make it ready to use (no output yet!). Let’s test it!
times_2(x = 10)
[1] 20
{ }
Adding the curly brackets - {}
- allows you to use functions spanning multiple lines:
times_2 <- function(x) { x * 2 } times_2(x = 10)
[1] 20
is_even <- function(x) { x %% 2 == 0 } is_even(x = 11)
[1] FALSE
is_even(x = times_2(x = 10))
[1] TRUE
return
If we want something specific for the function’s output, we use return()
:
times_2_plus_4 <- function(x) { output_int <- x * 2 output <- output_int + 4 return(output) } times_2_plus_4(x = 10)
[1] 24
return
not called, last evaluated expression is returnedreturn
should be the last step (steps after may be skipped)times_2_plus_4 <- function(x) { output_int <- x * 2 output <- output_int + 4 print(paste("times2 result = ", output_int)) return(output) } result <- times_2_plus_4(x = 10)
[1] "times2 result = 20"
result
[1] 24
Functions can take multiple inputs:
times_2_plus_y <- function(x, y) x * 2 + y times_2_plus_y(x = 10, y = 3)
[1] 23
Functions can have one returned result with multiple outputs.
x_and_y_plus_2 <- function(x, y) { output1 <- x + 2 output2 <- y + 2 return(c(output1, output2)) } result <- x_and_y_plus_2(x = 10, y = 3) result
[1] 12 5
Functions can have “default” arguments. This lets us use the function without using an argument later:
times_2_plus_y <- function(x = 10, y = 3) x * 2 + y times_2_plus_y()
[1] 23
times_2_plus_y(x = 11, y = 4)
[1] 26
Let’s write a function, sqdif
, that:
x
and y
with default values of 2 and 3.Functions can have any kind of input. Here is a function with characters:
loud <- function(word) { output <- rep(toupper(word), 5) return(output) } loud(word = "hooray!")
[1] "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!"
select(n)
will choose column n
:
get_index <- function(dat, row, col) { dat %>% filter(row_number() == row) %>% select(all_of(col)) } get_index(dat = iris, row = 10, col = 5)
Species 1 setosa
Including default values for arguments:
get_top <- function(dat, row = 1, col = 1) { dat %>% filter(row_number() == row) %>% select(all_of(col)) } get_top(dat = iris)
Sepal.Length 1 5.1
Can create function with an argument that allows inputting a column name for select
or other dplyr
operation:
clean_dataset <- function(dataset, col_name) { my_data_out <- dataset %>% select({{col_name}}) # Note the curly braces {{}} write_csv(my_data_out, "clean_data.csv") return(my_data_out) } clean_dataset(dataset = mtcars, col_name = "cyl")
cyl Mazda RX4 6 Mazda RX4 Wag 6 Datsun 710 4 Hornet 4 Drive 6 Hornet Sportabout 8 Valiant 6 Duster 360 8 Merc 240D 4 Merc 230 4 Merc 280 6 Merc 280C 6 Merc 450SE 8 Merc 450SL 8 Merc 450SLC 8 Cadillac Fleetwood 8 Lincoln Continental 8 Chrysler Imperial 8 Fiat 128 4 Honda Civic 4 Toyota Corolla 4 Toyota Corona 4 Dodge Challenger 8 AMC Javelin 8 Camaro Z28 8 Pontiac Firebird 8 Fiat X1-9 4 Porsche 914-2 4 Lotus Europa 4 Ford Pantera L 8 Ferrari Dino 6 Maserati Bora 8 Volvo 142E 4
# Another example: get means and missing for a specific column get_summary <- function(dataset, col_name) { dataset %>% summarise(mean = mean({{col_name}}, na.rm = TRUE), na_count = sum(is.na({{col_name}}))) } get_summary(mtcars, hp)
mean na_count 1 146.6875 0
NEW_FUNCTION <- function(x, y){x + y}
function(x = 1, y = 2){x + y}
-return
will provide a value as outputprint
will simply print the value on the screen but not save it{{double curly braces}}
💻 Lab
sapply()
- a base R functionNow that you’ve made a function… you can “apply” functions easily with sapply()
!
These functions take the form:
sapply(<a vector, list, data frame>, some_function)
sapply()
🚨 There are no parentheses on the functions! 🚨
You can also pipe into your function.
head(iris, n = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa
sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
iris %>% sapply(class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
sapply()
cars <- read_csv("https://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv") select(cars, VehYear:VehicleAge) %>% head()
# A tibble: 6 × 2 VehYear VehicleAge <dbl> <dbl> 1 2006 3 2 2004 5 3 2005 4 4 2004 5 5 2005 4 6 2004 5
select(cars, VehYear:VehicleAge) %>% sapply(times_2) %>% head()
VehYear VehicleAge [1,] 4012 6 [2,] 4008 10 [3,] 4010 8 [4,] 4008 10 [5,] 4010 8 [6,] 4008 10
Also called an “anonymous function”.
select(cars, VehYear:VehicleAge) %>% sapply(function(x) x / 1000) %>% head()
VehYear VehicleAge [1,] 2.006 0.003 [2,] 2.004 0.005 [3,] 2.005 0.004 [4,] 2.004 0.005 [5,] 2.005 0.004 [6,] 2.004 0.005
select(cars, VehYear:VehicleAge) %>% sapply(\(x) x / 1000) %>% head()
VehYear VehicleAge [1,] 2.006 0.003 [2,] 2.004 0.005 [3,] 2.005 0.004 [4,] 2.004 0.005 [5,] 2.005 0.004 [6,] 2.004 0.005
mutate()
and summarize()
Already know how to use functions to modify columns using mutate()
or calculate summary statistics using summarize()
.
cars %>% mutate(VehOdo_round = round(VehOdo, -3)) %>% summarize(max_Odo_round = max(VehOdo_round), max_Odo = max(VehOdo))
# A tibble: 1 × 2 max_Odo_round max_Odo <dbl> <dbl> 1 116000 115717
across()
functionImage by Allison Horst.
across
from dplyr
across()
makes it easy to apply the same transformation to multiple columns. Usually used with summarize()
or mutate()
.
summarize(across( .cols = <columns>, .fns = function))
or
mutate(across(.cols = <columns>, .fns = function))
.cols =
.fns =
na.rm = TRUE
), the function may need to be modified to an anonymous function, e.g., \(x) mean(x, na.rm = TRUE)
across
from dplyr
Combining with summarize()
cars_dbl <- cars %>% select(Make, starts_with("Veh")) cars_dbl %>% summarize(across(.cols = everything(), .fns = mean)) # no parentheses
# A tibble: 1 × 5 Make VehYear VehicleAge VehOdo VehBCost <dbl> <dbl> <dbl> <dbl> <dbl> 1 NA 2005. 4.18 71500. 6731.
across
from dplyr
Can use with other tidyverse functions like group_by
!
cars_dbl %>% group_by(Make) %>% summarize(across(.cols = everything(), .fns = mean)) # no parentheses
# A tibble: 33 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 ACURA 2003. 6.52 81732. 9039. 2 BUICK 2004. 5.65 76238. 6169. 3 CADILLAC 2004. 5.24 73770. 10958. 4 CHEVROLET 2006. 3.97 73390. 6835. 5 CHRYSLER 2006. 3.65 66814. 6507. 6 DODGE 2006. 3.75 68261. 7047. 7 FORD 2005. 4.75 76749. 6403. 8 GMC 2004. 5.61 79273. 8342. 9 HONDA 2004. 5.33 77877. 8350. 10 HUMMER 2006 3 70809 11920 # ℹ 23 more rows
across
from dplyr
To add arguments to functions, may need to use anonymous function. In this syntax, the shorthand \(x)
is equivalent to function(x)
.
cars_dbl %>% group_by(Make) %>% summarize(across(.cols = everything(), .fns = \(x) mean(x, na.rm = TRUE)))
# A tibble: 33 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 ACURA 2003. 6.52 81732. 9039. 2 BUICK 2004. 5.65 76238. 6169. 3 CADILLAC 2004. 5.24 73770. 10958. 4 CHEVROLET 2006. 3.97 73390. 6835. 5 CHRYSLER 2006. 3.65 66814. 6507. 6 DODGE 2006. 3.75 68261. 7047. 7 FORD 2005. 4.75 76749. 6403. 8 GMC 2004. 5.61 79273. 8342. 9 HONDA 2004. 5.33 77877. 8350. 10 HUMMER 2006 3 70809 11920 # ℹ 23 more rows
across
from dplyr
Using different tidyselect()
options (e.g., starts_with()
, ends_with()
, contains()
)
cars_dbl %>% group_by(Make) %>% summarize(across(.cols = starts_with("Veh"), .fns = mean))
# A tibble: 33 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 ACURA 2003. 6.52 81732. 9039. 2 BUICK 2004. 5.65 76238. 6169. 3 CADILLAC 2004. 5.24 73770. 10958. 4 CHEVROLET 2006. 3.97 73390. 6835. 5 CHRYSLER 2006. 3.65 66814. 6507. 6 DODGE 2006. 3.75 68261. 7047. 7 FORD 2005. 4.75 76749. 6403. 8 GMC 2004. 5.61 79273. 8342. 9 HONDA 2004. 5.33 77877. 8350. 10 HUMMER 2006 3 70809 11920 # ℹ 23 more rows
across
from dplyr
Combining with mutate()
: rounding to the nearest power of 10 (with negative digits value)
cars_dbl %>% mutate(across( .cols = starts_with("Veh"), .fns = round, digits = -3 ))
Warning: There was 1 warning in `mutate()`. ℹ In argument: `across(.cols = starts_with("Veh"), .fns = round, digits = -3)`. Caused by warning: ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0. Supply arguments directly to `.fns` through an anonymous function instead. # Previously across(a:b, mean, na.rm = TRUE) # Now across(a:b, \(x) mean(x, na.rm = TRUE))
# A tibble: 72,983 × 5 Make VehYear VehicleAge VehOdo VehBCost <chr> <dbl> <dbl> <dbl> <dbl> 1 MAZDA 2000 0 89000 7000 2 DODGE 2000 0 94000 8000 3 DODGE 2000 0 74000 5000 4 DODGE 2000 0 66000 4000 5 FORD 2000 0 69000 4000 6 MITSUBISHI 2000 0 81000 6000 7 KIA 2000 0 65000 4000 8 FORD 2000 0 66000 4000 9 KIA 2000 0 50000 6000 10 FORD 2000 0 85000 8000 # ℹ 72,973 more rows
across
from dplyr
Combining with mutate()
- the replace_na
function
replace_na({data frame}, {list of values})
or replace_na({vector}, {single value})
# Child mortality data mort <- read_csv("https://jhudatascience.org/intro_to_r/data/mortality.csv") %>% rename(country = `...1`) mort %>% select(country, starts_with("194")) %>% mutate(across( .cols = c(`1943`, `1944`, `1945`), .fns = replace_na, replace = 0 ))
# A tibble: 197 × 11 country `1940` `1941` `1942` `1943` `1944` `1945` `1946` `1947` `1948` `1949` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghan… NA NA NA 0 0 0 NA NA NA NA 2 Albania 1.53 1.31 1.48 1.46 1.43 1.40 1.37 1.41 1.37 1.34 3 Algeria NA NA NA 0 0 0 NA NA NA NA 4 Angola 4.46 4.46 4.46 4.34 4.34 4.34 4.33 4.22 4.22 4.21 5 Argent… 0.641 0.603 0.602 0.558 0.551 0.510 0.503 0.496 0.494 0.492 6 Armenia NA NA NA 0 0 0 NA NA NA NA 7 Aruba NA NA NA 0 0 0 NA NA NA NA 8 Austra… 0.263 0.275 0.276 0.299 0.260 0.271 0.295 0.279 0.271 0.271 9 Austria 0.504 0.474 0.417 0.389 0.360 0.311 0.311 0.312 0.274 0.274 10 Azerba… NA NA NA 0 0 0 NA NA NA NA # ℹ 187 more rows
mutate
and across
If your function needs to span more than one line, better to define it first before using inside mutate()
and across()
.
times1000 <- function(x) x * 1000 airquality %>% mutate(across( .cols = everything(), .fns = times1000 )) %>% head(n = 2)
Ozone Solar.R Wind Temp Month Day 1 41000 190000 7400 67000 5000 1000 2 36000 118000 8000 72000 5000 2000
airquality %>% mutate(across( .cols = everything(), .fns = function(x) x * 1000 )) %>% head(n = 2)
Ozone Solar.R Wind Temp Month Day 1 41000 190000 7400 67000 5000 1000 2 36000 118000 8000 72000 5000 2000
Why use across()
?
A. Efficiency - faster and less repetitive
B. Calculate the cross product
C. Connect across datasets
purrr
packageSimilar to across, purrr
is a package that allows you to apply a function to multiple columns in a data frame or multiple data objects in a list.
A list in R is a generic class of data consisting of an ordered collection of objects. It can include any number of single numeric objects, vectors, or data frames – can be all the same class of objects or all different.
While we won’t get into purrr
too much in this class, its a handy package for you to know about should you get into a situation where you have an irregular list you need to handle!
Lists help us work with multiple data frames
AQ_list <- list(AQ1 = airquality, AQ2 = airquality, AQ3 = airquality) str(AQ_list)
List of 3 $ AQ1:'data.frame': 153 obs. of 6 variables: ..$ Ozone : int [1:153] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:153] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ... $ AQ2:'data.frame': 153 obs. of 6 variables: ..$ Ozone : int [1:153] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:153] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ... $ AQ3:'data.frame': 153 obs. of 6 variables: ..$ Ozone : int [1:153] 41 36 12 18 NA 28 23 19 8 NA ... ..$ Solar.R: int [1:153] 190 118 149 313 NA NA 299 99 19 194 ... ..$ Wind : num [1:153] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... ..$ Temp : int [1:153] 67 72 74 62 56 66 65 59 61 69 ... ..$ Month : int [1:153] 5 5 5 5 5 5 5 5 5 5 ... ..$ Day : int [1:153] 1 2 3 4 5 6 7 8 9 10 ...
sapply
AQ_list %>% sapply(class)
AQ1 AQ2 AQ3 "data.frame" "data.frame" "data.frame"
AQ_list %>% sapply(nrow)
AQ1 AQ2 AQ3 153 153 153
AQ_list %>% sapply(colMeans, na.rm = TRUE)
AQ1 AQ2 AQ3 Ozone 42.129310 42.129310 42.129310 Solar.R 185.931507 185.931507 185.931507 Wind 9.957516 9.957516 9.957516 Temp 77.882353 77.882353 77.882353 Month 6.993464 6.993464 6.993464 Day 15.803922 15.803922 15.803922
sapply(<a vector or list>, some_function)
across()
to apply functions across multiple columns of dataacross
within summarize()
or mutate()
sapply
or purrr
to work with multiple data frames within lists simultaneously💻 Lab
Image by Gerd Altmann from Pixabay
Image by Allison Horst.