An advanced subject: functions

session info

Writing your own functions

So far we’ve seen many functions, like c(), class(), filter(), dim()

Why create your own functions?

  • Cut down on repetitive code (easier to fix things!)
  • Organize code into manageable chunks
  • Avoid running code unintentionally
  • Use names that make sense to you

A practical example: summarization

There may be code that you use multiple times. Creating a function can help cut down on repetitive code (and the chance for copy/paste errors).

data_insights <- function(x, column1, column2) {
    x_insight <- x %>%
      group_by({{column1}}) %>%
      summarize(mean = mean({{column2}}, na.rm = TRUE))
    return(x_insight)
}

data_insights(x = mtcars, column1 = cyl, column2 = hp)
# A tibble: 3 × 2
    cyl  mean
  <dbl> <dbl>
1     4  82.6
2     6 122. 
3     8 209. 

A practical example: plotting

You may have a similar plot that you want to examine across columns of data.

simple_plots <- function(x, column1, column2) {
    box_plot <- ggplot(data = x, aes(x = {{column1}}, y = {{column2}}, group = {{column1}})) +
      geom_boxplot()
    return(box_plot)
}

simple_plots(x = mtcars, column1 = cyl, column2 = hp)

Writing your own functions

The general syntax for a function is:

function_name <- function(arg1, arg2, ...) {
 <function body>
}

Writing your own functions

Here we will write a function that multiplies some number x by 2:

times_2 <- function(x) x * 2

When you run the line of code above, you make it ready to use (no output yet!). Let’s test it!

times_2(x = 10)
[1] 20

Writing your own functions: { }

Adding the curly brackets - {} - allows you to use functions spanning multiple lines:

times_2 <- function(x) {
  x * 2
}
times_2(x = 10)
[1] 20
is_even <- function(x) {
  x %% 2 == 0
}
is_even(x = 11)
[1] FALSE
is_even(x = times_2(x = 10))
[1] TRUE

Writing your own functions: return

If we want something specific for the function’s output, we use return():

times_2_plus_4 <- function(x) {
  output_int <- x * 2
  output <- output_int + 4
  return(output)
}
times_2_plus_4(x = 10)
[1] 24

Writing your own functions: print intermediate steps

  • printed results do not stay around but can show what a function is doing
  • returned results stay around
  • can only return one result but can print many
  • if return not called, last evaluated expression is returned
  • return should be the last step (steps after may be skipped)

Adding print

times_2_plus_4 <- function(x) {
  output_int <- x * 2
  output <- output_int + 4
  print(paste("times2 result = ", output_int))
  return(output)
}

result <- times_2_plus_4(x = 10)
[1] "times2 result =  20"
result
[1] 24

Writing your own functions: multiple inputs

Functions can take multiple inputs:

times_2_plus_y <- function(x, y) x * 2 + y
times_2_plus_y(x = 10, y = 3)
[1] 23

Writing your own functions: multiple outputs

Functions can have one returned result with multiple outputs.

x_and_y_plus_2 <- function(x, y) {
  output1 <- x + 2
  output2 <- y + 2

  return(c(output1, output2))
}
result <- x_and_y_plus_2(x = 10, y = 3)
result
[1] 12  5

Writing your own functions: defaults

Functions can have “default” arguments. This lets us use the function without using an argument later:

times_2_plus_y <- function(x = 10, y = 3) x * 2 + y
times_2_plus_y()
[1] 23
times_2_plus_y(x = 11, y = 4)
[1] 26

Writing another simple function

Let’s write a function, sqdif, that:

  1. takes two numbers x and y with default values of 2 and 3.
  2. takes the difference
  3. squares this difference
  4. then returns the final value

Writing your own functions: characters

Functions can have any kind of input. Here is a function with characters:

loud <- function(word) {
  output <- rep(toupper(word), 5)
  return(output)
}
loud(word = "hooray!")
[1] "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!"

Functions for tibbles

select(n) will choose column n:

get_index <- function(dat, row, col) {
  dat %>%
    filter(row_number() == row) %>%
    select(all_of(col))
}

get_index(dat = iris, row = 10, col = 5)
  Species
1  setosa

Functions for tibbles

Including default values for arguments:

get_top <- function(dat, row = 1, col = 1) {
  dat %>%
    filter(row_number() == row) %>%
    select(all_of(col))
}

get_top(dat = iris)
  Sepal.Length
1          5.1

Functions for tibbles - curly braces

Can create function with an argument that allows inputting a column name for select or other dplyr operation:

clean_dataset <- function(dataset, col_name) {
  my_data_out <- dataset %>% select({{col_name}}) # Note the curly braces {{}}
  write_csv(my_data_out, "clean_data.csv")
  return(my_data_out)
}

clean_dataset(dataset = mtcars, col_name = "cyl")
                    cyl
Mazda RX4             6
Mazda RX4 Wag         6
Datsun 710            4
Hornet 4 Drive        6
Hornet Sportabout     8
Valiant               6
Duster 360            8
Merc 240D             4
Merc 230              4
Merc 280              6
Merc 280C             6
Merc 450SE            8
Merc 450SL            8
Merc 450SLC           8
Cadillac Fleetwood    8
Lincoln Continental   8
Chrysler Imperial     8
Fiat 128              4
Honda Civic           4
Toyota Corolla        4
Toyota Corona         4
Dodge Challenger      8
AMC Javelin           8
Camaro Z28            8
Pontiac Firebird      8
Fiat X1-9             4
Porsche 914-2         4
Lotus Europa          4
Ford Pantera L        8
Ferrari Dino          6
Maserati Bora         8
Volvo 142E            4

Functions for tibbles - curly braces

# Another example: get means and missing for a specific column
get_summary <- function(dataset, col_name) {
    dataset %>%  
    summarise(mean = mean({{col_name}}, na.rm = TRUE),
              na_count = sum(is.na({{col_name}})))
}

get_summary(mtcars, hp)
      mean na_count
1 146.6875        0

Summary

  • Simple functions take the form:
    • NEW_FUNCTION <- function(x, y){x + y}
    • Can specify defaults like function(x = 1, y = 2){x + y} -return will provide a value as output
    • print will simply print the value on the screen but not save it
  • Specify a column (from a tibble) inside a function using {{double curly braces}}

Lab Part 1

Functions on multiple columns

Using your custom functions: sapply()- a base R function

Now that you’ve made a function… you can “apply” functions easily with sapply()!

These functions take the form:

sapply(<a vector, list, data frame>, some_function)

Using your custom functions: sapply()

🚨 There are no parentheses on the functions! 🚨

You can also pipe into your function.

head(iris, n = 2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
sapply(iris, class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor" 
iris %>% sapply(class)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

Using your custom functions: sapply()

cars <- read_csv("https://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv")
select(cars, VehYear:VehicleAge) %>% head()
# A tibble: 6 × 2
  VehYear VehicleAge
    <dbl>      <dbl>
1    2006          3
2    2004          5
3    2005          4
4    2004          5
5    2005          4
6    2004          5
select(cars, VehYear:VehicleAge) %>%
  sapply(times_2) %>%
  head()
     VehYear VehicleAge
[1,]    4012          6
[2,]    4008         10
[3,]    4010          8
[4,]    4008         10
[5,]    4010          8
[6,]    4008         10

Using your custom functions “on the fly” to iterate

Also called an “anonymous function”.

select(cars, VehYear:VehicleAge) %>%
  sapply(function(x) x / 1000) %>%
  head()
     VehYear VehicleAge
[1,]   2.006      0.003
[2,]   2.004      0.005
[3,]   2.005      0.004
[4,]   2.004      0.005
[5,]   2.005      0.004
[6,]   2.004      0.005

Anonymous functions: alternative syntax

select(cars, VehYear:VehicleAge) %>%
  sapply(\(x) x / 1000) %>%
  head()
     VehYear VehicleAge
[1,]   2.006      0.003
[2,]   2.004      0.005
[3,]   2.005      0.004
[4,]   2.004      0.005
[5,]   2.005      0.004
[6,]   2.004      0.005

across

Using functions in mutate() and summarize()

Already know how to use functions to modify columns using mutate() or calculate summary statistics using summarize().

cars %>%
  mutate(VehOdo_round = round(VehOdo, -3)) %>%
  summarize(max_Odo_round = max(VehOdo_round),
            max_Odo = max(VehOdo))
# A tibble: 1 × 2
  max_Odo_round max_Odo
          <dbl>   <dbl>
1        116000  115717

The across() function