Load all the libraries we will use in this lab.
library(readr)
library(dplyr)
library(ggplot2)
Create a function that takes one argument, a vector, and returns the
sum of the vector and squares the result. Call it “sum_squared”. Test
your function on the vector c(2,7,21,30,90)
- you should
get the answer 22500.
# General format
NEW_FUNCTION <- function(x, y) x + y
or
# General format
NEW_FUNCTION <- function(x, y){
result <- x + y
return(result)
}
nums <- c(2, 7, 21, 30, 90)
sum_squared <- function(x) sum(x)^2
sum_squared(x = nums)
## [1] 22500
sum_squared <- function(x) {
out <- sum(x)^2
return(out)
}
sum_squared(x = nums)
## [1] 22500
Create a function that takes two arguments, (1) a vector and (2) a
numeric value. This function tests whether the number (2) is contained
within the vector (1). Hint: use %in%
.
Call it has_n
. Test your function on the vector
c(2,7,21,30,90)
and number 21
- you should get
the answer TRUE.
nums <- c(2, 7, 21, 30, 90)
a_num <- 21
has_n <- function(x, n) n %in% x
has_n(x = nums, n = a_num)
## [1] TRUE
Amend the function has_n
from question 1.2 so that it
takes a default value of 21
for the numeric argument.
nums <- c(2, 7, 21, 30, 90)
a_num <- 21
has_n <- function(x, n = 21) n %in% x
has_n(x = nums)
## [1] TRUE
Create a new number b_num
that is not contained with
nums
. Use your updated has_n
function with the
default value and add b_num
as the n
argument
when calling the function. What is the outcome?
b_num <- 11
has_n(x = nums, n = b_num)
## [1] FALSE
Read in the SARS-CoV-2 Vaccination data from http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv. Assign the data the name “vacc”.
vacc <- read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv")
# If downloaded
# vacc <- read_csv("USA_covid19_vaccinations.csv")
We want to get some summary statistics on the Moderna vaccines. Use
across
inside summarize
to get the sum total
number vaccine doses for any variable containing the word “Moderna” AND
starting with “Total”. Hint: use
contains()
AND starts_with()
to select the
right columns inside across
. Keep in mind that this
includes the United States as a whole and so it is not totally accurate!
Remember that NA
values can influence calculations.
# General format
data %>%
summarize(across(
.cols = {vector or tidyselect},
.fns = {some function},
{additional arguments}
))
vacc %>%
summarize(across(
.cols = contains("Moderna") & starts_with("Total"),
.fns = sum
))
## # A tibble: 1 × 4
## Total Number of Moderna doses …¹ Total Number of Mode…² Total Count People w…³
## <dbl> <dbl> <dbl>
## 1 482227080 403816391 NA
## # ℹ abbreviated names: ¹`Total Number of Moderna doses delivered`,
## # ²`Total Number of Moderna doses administered`,
## # ³`Total Count People w/Booster Primary Moderna Minus TX`
## # ℹ 1 more variable:
## # `Total Count People w/Booster Booster Moderna Minus TX` <dbl>
vacc %>%
summarize(across(
.cols = contains("Moderna") & starts_with("Total"),
.fns = sum,
na.rm = TRUE
))
## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 4
## Total Number of Moderna doses …¹ Total Number of Mode…² Total Count People w…³
## <dbl> <dbl> <dbl>
## 1 482227080 403816391 29736587
## # ℹ abbreviated names: ¹`Total Number of Moderna doses delivered`,
## # ²`Total Number of Moderna doses administered`,
## # ³`Total Count People w/Booster Primary Moderna Minus TX`
## # ℹ 1 more variable:
## # `Total Count People w/Booster Booster Moderna Minus TX` <dbl>
Use across
and mutate
to convert all
columns containing the word “Percent” into proportions (i.e., divide
that value by 100). Hint: use contains()
to select the right columns within across()
. Use a
“function on the fly” to divide by 100. It will also be easier to check
your work if you select()
columns that match “Percent”.
vacc %>%
mutate(across(
.cols = contains("Percent"),
.fns = function(x) x / 100
)) %>%
select(contains("Percent"))
## # A tibble: 64 × 34
## Percent of Total Pop with at …¹ Percent of 18+ Pop w…² Percent of Total Pop…³
## <dbl> <dbl> <dbl>
## 1 0.746 0.866 0.627
## 2 0.657 0.776 0.568
## 3 0.593 0.711 0.482
## 4 0.636 0.751 0.518
## 5 0.876 0.95 0.76
## 6 0.684 0.792 0.577
## 7 NA NA NA
## 8 0.843 0.95 0.67
## 9 0.754 0.869 0.668
## 10 0.905 0.95 0.755
## # ℹ 54 more rows
## # ℹ abbreviated names:
## # ¹`Percent of Total Pop with at least One Dose by State of Residence`,
## # ²`Percent of 18+ Pop with at least One Dose by State of Residence`,
## # ³`Percent of Total Pop Fully Vaccinated by State of Residence`
## # ℹ 31 more variables:
## # `Percent of 18+ Pop Fully Vaccinated by State of Residence` <dbl>, …
Use across
and mutate
to convert all
columns starting with the word “Total” into a binary variable: TRUE if
the value is greater than 10,000,000 and FALSE if less than or equal to
10,000,000. Hint: use starts_with()
to
select the columns starting with “Total”. Use a “function on the fly” to
do a logical test if the value is greater than 10,000,000.
vacc %>%
mutate(across(
.cols = starts_with("Total"),
.fns = function(x) x > 10000000
))
## # A tibble: 64 × 125
## State/Territory/Federal Entit…¹ Total Doses Delivere…² Doses Delivered per …³
## <chr> <lgl> <dbl>
## 1 United States TRUE 194167
## 2 Alaska FALSE 185553
## 3 Alabama FALSE 175845
## 4 Arkansas FALSE 177298
## 5 American Samoa FALSE 179165
## 6 Arizona TRUE 180044
## 7 Bureau of Prisons FALSE NA
## 8 California TRUE 201694
## 9 Colorado TRUE 194586
## 10 Connecticut FALSE 218700
## # ℹ 54 more rows
## # ℹ abbreviated names: ¹`State/Territory/Federal Entity`,
## # ²`Total Doses Delivered`, ³`Doses Delivered per 100K`
## # ℹ 122 more variables: `18+ Doses Delivered per 100K` <dbl>,
## # `Total Doses Administered by State where Administered` <lgl>,
## # `Doses Administered per 100k by State where Administered` <dbl>,
## # `18+ Doses Administered by State where Administered` <dbl>, …
Take your code from question 2.4 and assign it to the variable
vacc_dat
.
filter()
to drop any rows where “United States”
appears in State/Territory/Federal Entity
. Make sure to
reassign this to vacc_dat
.geom_boxplot()
) where (1) the
x-axis is Total Doses Delivered
and (2) the y-axis is
Percent of fully vaccinated people with booster doses
.labs()
layer so that the x-axis is
“Total Doses Delivered: Greater than 10,000,000”vacc_dat <-
vacc %>%
mutate(across(
.cols = starts_with("Total"),
.fns = function(x) x > 10000000
)) %>%
filter(`State/Territory/Federal Entity` != "United States")
vacc_boxplot <- function(df) {
ggplot(df) +
geom_boxplot(aes(
x = `Total Doses Delivered`,
y = `Percent of fully vaccinated people with booster doses`
)) +
labs(x = "Total Doses Delivered: Greater than 10,000,000")
}
vacc_boxplot(vacc_dat)