So far we’ve seen many functions, like c()
, class()
, filter()
, dim()
…
Why create your own functions?
There may be code that you use multiple times. Creating a function can help cut down on repetitive code (and the chance for copy/paste errors).
data_insights <- function(x, column1, column2) { x_insight <- x %>% group_by({{column1}}) %>% summarize(mean = mean({{column2}}, na.rm = TRUE)) return(x_insight) } data_insights(x = mtcars, column1 = cyl, column2 = hp)
# A tibble: 3 × 2 cyl mean <dbl> <dbl> 1 4 82.6 2 6 122. 3 8 209.
You may have a similar plot that you want to examine across columns of data.
simple_plots <- function(x, column1, column2) { box_plot <- ggplot(data = x, aes(x = {{column1}}, y = {{column2}}, group = {{column1}})) + geom_boxplot() return(box_plot) } simple_plots(x = mtcars, column1 = cyl, column2 = hp)
The general syntax for a function is:
function_name <- function(arg1, arg2, ...) { <function body> }
Here we will write a function that multiplies some number x
by 2:
times_2 <- function(x) x * 2
When you run the line of code above, you make it ready to use (no output yet!). Let’s test it!
times_2(x = 10)
[1] 20
{ }
Adding the curly brackets - {}
- allows you to use functions spanning multiple lines:
times_2 <- function(x) { x * 2 } times_2(x = 10)
[1] 20
is_even <- function(x) { x %% 2 == 0 } is_even(x = 11)
[1] FALSE
is_even(x = times_2(x = 10))
[1] TRUE
return
If we want something specific for the function’s output, we use return()
:
times_2_plus_4 <- function(x) { output_int <- x * 2 output <- output_int + 4 return(output) } times_2_plus_4(x = 10)
[1] 24
return
not called, last evaluated expression is returnedreturn
should be the last step (steps after may be skipped)times_2_plus_4 <- function(x) { output_int <- x * 2 output <- output_int + 4 print(paste("times2 result = ", output_int)) return(output) } result <- times_2_plus_4(x = 10)
[1] "times2 result = 20"
result
[1] 24
Functions can take multiple inputs:
times_2_plus_y <- function(x, y) x * 2 + y times_2_plus_y(x = 10, y = 3)
[1] 23
Functions can have one returned result with multiple outputs.
x_and_y_plus_2 <- function(x, y) { output1 <- x + 2 output2 <- y + 2 return(c(output1, output2)) } result <- x_and_y_plus_2(x = 10, y = 3) result
[1] 12 5
Functions can have “default” arguments. This lets us use the function without using an argument later:
times_2_plus_y <- function(x = 10, y = 3) x * 2 + y times_2_plus_y()
[1] 23
times_2_plus_y(x = 11, y = 4)
[1] 26
Let’s write a function, sqdif
, that:
x
and y
with default values of 2 and 3.Functions can have any kind of input. Here is a function with characters:
loud <- function(word) { output <- rep(toupper(word), 5) return(output) } loud(word = "hooray!")
[1] "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!"
select(n)
will choose column n
:
get_index <- function(dat, row, col) { dat %>% filter(row_number() == row) %>% select(all_of(col)) } get_index(dat = iris, row = 10, col = 5)
Species 1 setosa
Including default values for arguments:
get_top <- function(dat, row = 1, col = 1) { dat %>% filter(row_number() == row) %>% select(all_of(col)) } get_top(dat = iris)
Sepal.Length 1 5.1
Can create function with an argument that allows inputting a column name for select
or other dplyr
operation:
clean_dataset <- function(dataset, col_name) { my_data_out <- dataset %>% select({{col_name}}) # Note the curly braces {{}} write_csv(my_data_out, "clean_data.csv") return(my_data_out) } clean_dataset(dataset = mtcars, col_name = "cyl")
cyl Mazda RX4 6 Mazda RX4 Wag 6 Datsun 710 4 Hornet 4 Drive 6 Hornet Sportabout 8 Valiant 6 Duster 360 8 Merc 240D 4 Merc 230 4 Merc 280 6 Merc 280C 6 Merc 450SE 8 Merc 450SL 8 Merc 450SLC 8 Cadillac Fleetwood 8 Lincoln Continental 8 Chrysler Imperial 8 Fiat 128 4 Honda Civic 4 Toyota Corolla 4 Toyota Corona 4 Dodge Challenger 8 AMC Javelin 8 Camaro Z28 8 Pontiac Firebird 8 Fiat X1-9 4 Porsche 914-2 4 Lotus Europa 4 Ford Pantera L 8 Ferrari Dino 6 Maserati Bora 8 Volvo 142E 4
# Another example: get means and missing for a specific column get_summary <- function(dataset, col_name) { dataset %>% summarise(mean = mean({{col_name}}, na.rm = TRUE), na_count = sum(is.na({{col_name}}))) } get_summary(mtcars, hp)
mean na_count 1 146.6875 0
NEW_FUNCTION <- function(x, y){x + y}
function(x = 1, y = 2){x + y}
-return
will provide a value as outputprint
will simply print the value on the screen but not save it{{double curly braces}}
💻 Lab
sapply()
- a base R functionNow that you’ve made a function… you can “apply” functions easily with sapply()
!
These functions take the form:
sapply(<a vector, list, data frame>, some_function)
sapply()
🚨 There are no parentheses on the functions! 🚨
You can also pipe into your function.
head(iris, n = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa
sapply(iris, class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
iris %>% sapply(class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species "numeric" "numeric" "numeric" "numeric" "factor"
sapply()
cars <- read_csv("https://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv") select(cars, VehYear:VehicleAge) %>% head()
# A tibble: 6 × 2 VehYear VehicleAge <dbl> <dbl> 1 2006 3 2 2004 5 3 2005 4 4 2004 5 5 2005 4 6 2004 5
select(cars, VehYear:VehicleAge) %>% sapply(times_2) %>% head()
VehYear VehicleAge [1,] 4012 6 [2,] 4008 10 [3,] 4010 8 [4,] 4008 10 [5,] 4010 8 [6,] 4008 10
Also called an “anonymous function”.
select(cars, VehYear:VehicleAge) %>% sapply(function(x) x / 1000) %>% head()
VehYear VehicleAge [1,] 2.006 0.003 [2,] 2.004 0.005 [3,] 2.005 0.004 [4,] 2.004 0.005 [5,] 2.005 0.004 [6,] 2.004 0.005
select(cars, VehYear:VehicleAge) %>% sapply(\(x) x / 1000) %>% head()
VehYear VehicleAge [1,] 2.006 0.003 [2,] 2.004 0.005 [3,] 2.005 0.004 [4,] 2.004 0.005 [5,] 2.005 0.004 [6,] 2.004 0.005
mutate()
and summarize()
Already know how to use functions to modify columns using mutate()
or calculate summary statistics using summarize()
.
cars %>% mutate(VehOdo_round = round(VehOdo, -3)) %>% summarize(max_Odo_round = max(VehOdo_round), max_Odo = max(VehOdo))
# A tibble: 1 × 2 max_Odo_round max_Odo <dbl> <dbl> 1 116000 115717
across()
function