Part 1

Load all the libraries we will use in this lab.

library(readr)
library(dplyr)
library(ggplot2)

1.1

Create a function that takes one argument, a vector, and returns the sum of the vector and squares the result. Call it “sum_squared”. Test your function on the vector c(2,7,21,30,90) - you should get the answer 22500.

# General format
NEW_FUNCTION <- function(x, y) x + y 

or

# General format
NEW_FUNCTION <- function(x, y){
result <- x + y 
return(result)
}
nums <- c(2, 7, 21, 30, 90)

sum_squared <- function(x) sum(x)^2
sum_squared(x = nums)
## [1] 22500
sum_squared <- function(x) {
  out <- sum(x)^2
  return(out)
}
sum_squared(x = nums)
## [1] 22500

1.2

Create a function that takes two arguments, (1) a vector and (2) a numeric value. This function tests whether the number (2) is contained within the vector (1). Hint: use %in%. Call it has_n. Test your function on the vector c(2,7,21,30,90) and number 21 - you should get the answer TRUE.

nums <- c(2, 7, 21, 30, 90)
a_num <- 21

has_n <- function(x, n) n %in% x
has_n(x = nums, n = a_num)
## [1] TRUE

1.3

Amend the function has_n from question 1.2 so that it takes a default value of 21 for the numeric argument.

nums <- c(2, 7, 21, 30, 90)
a_num <- 21

has_n <- function(x, n = 21) n %in% x
has_n(x = nums)
## [1] TRUE

1.4

Create a new number b_num that is not contained with nums. Use your updated has_n function with the default value and add b_num as the n argument when calling the function. What is the outcome?

b_num <- 11
has_n(x = nums, n = b_num)
## [1] FALSE

Part 2

2.1

Read in the SARS-CoV-2 Vaccination data from http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv. Assign the data the name “vacc”.

vacc <- read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv")
# If downloaded
# vacc <- read_csv("USA_covid19_vaccinations.csv")

2.2

We want to get some summary statistics on the Moderna vaccines. Use across inside summarize to get the sum total number vaccine doses for any variable containing the word “Moderna” AND starting with “Total”. Hint: use contains() AND starts_with() to select the right columns inside across. Keep in mind that this includes the United States as a whole and so it is not totally accurate! Remember that NA values can influence calculations.

# General format
data %>%
  summarize(across(
    .cols = {vector or tidyselect},
    .fns = {some function},
    {additional arguments}
  ))
vacc %>%
  summarize(across(
    .cols = contains("Moderna") & starts_with("Total"),
    .fns = sum
  ))
## # A tibble: 1 × 4
##   Total Number of Moderna doses …¹ Total Number of Mode…² Total Count People w…³
##                              <dbl>                  <dbl>                  <dbl>
## 1                        482227080              403816391                     NA
## # ℹ abbreviated names: ¹​`Total Number of Moderna doses delivered`,
## #   ²​`Total Number of Moderna doses administered`,
## #   ³​`Total Count People w/Booster Primary Moderna Minus TX`
## # ℹ 1 more variable:
## #   `Total Count People w/Booster Booster Moderna Minus TX` <dbl>
vacc %>%
  summarize(across(
    .cols = contains("Moderna") & starts_with("Total"),
    .fns = sum,
    na.rm = TRUE
  ))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 4
##   Total Number of Moderna doses …¹ Total Number of Mode…² Total Count People w…³
##                              <dbl>                  <dbl>                  <dbl>
## 1                        482227080              403816391               29736587
## # ℹ abbreviated names: ¹​`Total Number of Moderna doses delivered`,
## #   ²​`Total Number of Moderna doses administered`,
## #   ³​`Total Count People w/Booster Primary Moderna Minus TX`
## # ℹ 1 more variable:
## #   `Total Count People w/Booster Booster Moderna Minus TX` <dbl>

2.3

Use across and mutate to convert all columns containing the word “Percent” into proportions (i.e., divide that value by 100). Hint: use contains() to select the right columns within across(). Use a “function on the fly” to divide by 100. It will also be easier to check your work if you select() columns that match “Percent”.

vacc %>%
  mutate(across(
    .cols = contains("Percent"),
    .fns = function(x) x / 100
  )) %>%
  select(contains("Percent"))
## # A tibble: 64 × 34
##    Percent of Total Pop with at …¹ Percent of 18+ Pop w…² Percent of Total Pop…³
##                              <dbl>                  <dbl>                  <dbl>
##  1                           0.746                  0.866                  0.627
##  2                           0.657                  0.776                  0.568
##  3                           0.593                  0.711                  0.482
##  4                           0.636                  0.751                  0.518
##  5                           0.876                  0.95                   0.76 
##  6                           0.684                  0.792                  0.577
##  7                          NA                     NA                     NA    
##  8                           0.843                  0.95                   0.67 
##  9                           0.754                  0.869                  0.668
## 10                           0.905                  0.95                   0.755
## # ℹ 54 more rows
## # ℹ abbreviated names:
## #   ¹​`Percent of Total Pop with at least One Dose by State of Residence`,
## #   ²​`Percent of 18+ Pop with at least One Dose by State of Residence`,
## #   ³​`Percent of Total Pop Fully Vaccinated by State of Residence`
## # ℹ 31 more variables:
## #   `Percent of 18+ Pop Fully Vaccinated by State of Residence` <dbl>, …

2.4

Use across and mutate to convert all columns starting with the word “Total” into a binary variable: TRUE if the value is greater than 10,000,000 and FALSE if less than or equal to 10,000,000. Hint: use starts_with() to select the columns starting with “Total”. Use a “function on the fly” to do a logical test if the value is greater than 10,000,000.

vacc %>%
  mutate(across(
    .cols = starts_with("Total"),
    .fns = function(x) x > 10000000
  ))
## # A tibble: 64 × 125
##    State/Territory/Federal Entit…¹ Total Doses Delivere…² Doses Delivered per …³
##    <chr>                           <lgl>                                   <dbl>
##  1 United States                   TRUE                                   194167
##  2 Alaska                          FALSE                                  185553
##  3 Alabama                         FALSE                                  175845
##  4 Arkansas                        FALSE                                  177298
##  5 American Samoa                  FALSE                                  179165
##  6 Arizona                         TRUE                                   180044
##  7 Bureau of Prisons               FALSE                                      NA
##  8 California                      TRUE                                   201694
##  9 Colorado                        TRUE                                   194586
## 10 Connecticut                     FALSE                                  218700
## # ℹ 54 more rows
## # ℹ abbreviated names: ¹​`State/Territory/Federal Entity`,
## #   ²​`Total Doses Delivered`, ³​`Doses Delivered per 100K`
## # ℹ 122 more variables: `18+ Doses Delivered per 100K` <dbl>,
## #   `Total Doses Administered by State where Administered` <lgl>,
## #   `Doses Administered per 100k by State where Administered` <dbl>,
## #   `18+ Doses Administered by State where Administered` <dbl>, …

Practice on Your Own!

P.1

Take your code from question 2.4 and assign it to the variable vacc_dat.

  • use filter() to drop any rows where “United States” appears in State/Territory/Federal Entity. Make sure to reassign this to vacc_dat.
  • Create a ggplot boxplot (geom_boxplot()) where (1) the x-axis is Total Doses Delivered and (2) the y-axis is Percent of fully vaccinated people with booster doses.
  • You change the labs() layer so that the x-axis is “Total Doses Delivered: Greater than 10,000,000”
vacc_dat <-
  vacc %>%
  mutate(across(
    .cols = starts_with("Total"),
    .fns = function(x) x > 10000000
  )) %>%
  filter(`State/Territory/Federal Entity` != "United States")

vacc_boxplot <- function(df) {
  ggplot(df) +
    geom_boxplot(aes(
      x = `Total Doses Delivered`,
      y = `Percent of fully vaccinated people with booster doses`
    )) +
    labs(x = "Total Doses Delivered: Greater than 10,000,000")
}
vacc_boxplot(vacc_dat)