Data in this lab comes from the CDC (https://covid.cdc.gov/covid-data-tracker/#vaccinations_vacc-total-admin-rate-total - snapshot from January 12, 2022) and the Bureau of Economic Analysis (https://www.bea.gov/data/income-saving/personal-income-by-state).
library(tidyverse)
Read in the SARS-CoV-2 Vaccination data from http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv. You can use the url or download the data. Assign the data the name “vacc”. We will be reviewing new concepts here and incorporating some from week 1.
read_csv()
from the readr
package.read.csv()
.vacc <- read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv")
## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# If downloaded
# vacc <- read_csv("USA_covid19_vaccinations.csv")
Look at the column names using colnames
- do you notice any patterns?
colnames(vacc)
## [1] "State/Territory/Federal Entity"
## [2] "Total Doses Delivered"
## [3] "Doses Delivered per 100K"
## [4] "18+ Doses Delivered per 100K"
## [5] "Total Doses Administered by State where Administered"
## [6] "Doses Administered per 100k by State where Administered"
## [7] "18+ Doses Administered by State where Administered"
## [8] "18+ Doses Administered per 100K by State where Administered"
## [9] "People with at least One Dose by State of Residence"
## [10] "Percent of Total Pop with at least One Dose by State of Residence"
## [11] "People 18+ with at least One Dose by State of Residence"
## [12] "Percent of 18+ Pop with at least One Dose by State of Residence"
## [13] "People Fully Vaccinated by State of Residence"
## [14] "Percent of Total Pop Fully Vaccinated by State of Residence"
## [15] "People 18+ Fully Vaccinated by State of Residence"
## [16] "Percent of 18+ Pop Fully Vaccinated by State of Residence"
## [17] "Total Number of Pfizer doses delivered"
## [18] "Total Number of Moderna doses delivered"
## [19] "Total Number of Janssen doses delivered"
## [20] "Total Number of doses from Other manufacturer delivered"
## [21] "Total Number of Janssen doses administered"
## [22] "Total Number of Moderna doses administered"
## [23] "Total Number of Pfizer doses adminstered"
## [24] "Total Number of doses from Other manufacturer administered"
## [25] "People Fully Vaccinated Moderna Resident"
## [26] "People Fully Vaccinated Pfizer Resident"
## [27] "People Fully Vaccinated Janssen Resident"
## [28] "People Fully Vaccinated Other 2-dose manufacturer Resident"
## [29] "People 18+ Fully Vaccinated Moderna Resident"
## [30] "People 18+ Fully Vaccinated Pfizer Resident"
## [31] "People 18+ Fully Vaccinated Janssen Resident"
## [32] "People 18+ Fully Vaccinated Other 2-dose manufacturer Resident"
## [33] "People with 2 Doses by State of Residence"
## [34] "Percent of Total Pop with 1+ Doses by State of Residence"
## [35] "People 18+ with 1+ Doses by State of Residence"
## [36] "Percent of 18+ Pop with 1+ Doses by State of Residence"
## [37] "Percent of Total Pop with 2 Doses by State of Residence"
## [38] "People 18+ with 2 Doses by State of Residence"
## [39] "Percent of 18+ Pop with 2 Doses by State of Residence"
## [40] "People with 1+ Doses by State of Residence"
## [41] "People 65+ with at least One Dose by State of Residence"
## [42] "Percent of 65+ Pop with at least One Dose by State of Residence"
## [43] "People 65+ Fully Vaccinated by State of Residence"
## [44] "Percent of 65+ Pop Fully Vaccinated by State of Residence"
## [45] "People 65+ Fully Vaccinated_Moderna_Resident"
## [46] "People 65+ Fully Vaccinated_Pfizer_Resident"
## [47] "People 65+ Fully Vaccinated_Janssen_Resident"
## [48] "People 65+ Fully Vaccinated_Other 2-dose Manuf_Resident"
## [49] "65+ Doses Administered by State where Administered"
## [50] "Doses Administered per 100k of 65+ pop by State where Administered"
## [51] "Doses Delivered per 100k of 65+ pop"
## [52] "People 12+ with at least One Dose by State of Residence"
## [53] "Percent of 12+ Pop with at least One Dose by State of Residence"
## [54] "People 12+ Fully Vaccinated by State of Residence"
## [55] "Percent of 12+ Pop Fully Vaccinated by State of Residence"
## [56] "People 12+ Fully Vaccinated_Moderna_Resident"
## [57] "People 12+ Fully Vaccinated_Pfizer_Resident"
## [58] "People 12+ Fully Vaccinated_Janssen_Resident"
## [59] "People 12+ Fully Vaccinated_Other 2-dose Manuf_Resident"
## [60] "12+ Doses Administered by State where Administered"
## [61] "Doses Administered per 100k of 12+ pop by State where Administered"
## [62] "Doses Delivered per 100k of 12+ pop"
## [63] "People 5+ with at least One Dose by State of Residence"
## [64] "Percent of 5+ Pop with at least One Dose by State of Residence"
## [65] "People 5+ Fully Vaccinated by State of Residence"
## [66] "Percent of 5+ Pop Fully Vaccinated by State of Residence"
## [67] "People 5+ Fully Vaccinated_Moderna_Resident"
## [68] "People 5+ Fully Vaccinated_Pfizer_Resident"
## [69] "People 5+ Fully Vaccinated_Janssen_Resident"
## [70] "People 5+ Fully Vaccinated_Other 2-dose Manuf_Resident"
## [71] "5+ Doses Administered by State where Administered"
## [72] "Doses Administered per 100k of 5+ pop by State where Administered"
## [73] "Doses Delivered per 100k of 5+ pop"
## [74] "People who have received a booster dose"
## [75] "Percent of fully vaccinated people with booster doses"
## [76] "People 18+ who have received a booster dose"
## [77] "Percent of fully vaccinated people 18+ with booster doses"
## [78] "People 50+ who have received a booster dose"
## [79] "Percent of fully vaccinated people 50+ with booster doses"
## [80] "People 65+ who have received a booster dose"
## [81] "Percent of fully vaccinated people 65+ with booster doses"
## [82] "People with Moderna booster dose"
## [83] "People with Pfizer booster dose"
## [84] "People with Janssen booster dose"
## [85] "People with booster dose of an Other manufacturer"
## [86] "Total Count People w/Booster Primary Pfizer Minus TX"
## [87] "Total Count People w/Booster Primary Moderna Minus TX"
## [88] "Total Count People w/Booster Primary J&J Minus TX"
## [89] "Total Count People w/Booster Primary Other Minus TX"
## [90] "Total Count People w/Booster Booster Pfizer Minus TX"
## [91] "Total Count People w/Booster Booster Moderna Minus TX"
## [92] "Total Count People w/Booster Booster J&J Minus TX"
## [93] "Total Count People w/Booster Booster Other Minus TX"
## [94] "Count People Primary Pfizer Booster Pfizer"
## [95] "Count People Primary Pfizer Booster Moderna"
## [96] "Count People Primary Pfizer Booster J&J"
## [97] "Count People Primary Pfizer Booster Uknown"
## [98] "Count People Primary Moderna Booster Pfizer"
## [99] "Count People Primary Moderna Booster Moderna"
## [100] "Count People Primary Moderna Booster J&J"
## [101] "Count People Primary Moderna Booster Uknown"
## [102] "Count People Primary J&J Booster Pfizer"
## [103] "Count People Primary J&J Booster Moderna"
## [104] "Count People Primary J&J Booster J&J"
## [105] "Count People Primary J&J Booster Other"
## [106] "Count People Primary Other Booster Pfizer"
## [107] "Count People Primary Other Booster Moderna"
## [108] "Count People Primary Other Booster J&J"
## [109] "Count People Primary Other Booster Other"
## [110] "Percent People Primary Pfizer Booster Pfizer"
## [111] "Percent People Primary Pfizer Booster Moderna"
## [112] "Percent People Primary Pfizer Booster J&J"
## [113] "Percent People Primary Pfizer Booster Other"
## [114] "Percent People Primary Moderna Booster Pfizer"
## [115] "Percent People Primary Moderna Booster Moderna"
## [116] "Percent People Primary Moderna Booster J&J"
## [117] "Percent People Primary Moderna Booster Other"
## [118] "Percent People Primary J&J Booster Pfizer"
## [119] "Percent People Primary J&J Booster Moderna"
## [120] "Percent People Primary J&J Booster J&J"
## [121] "Percent People Primary J&J Booster Other"
## [122] "Percent People Primary Other Booster Pfizer"
## [123] "Percent People Primary Other Booster Moderna"
## [124] "Percent People Primary Other Booster J&J"
## [125] "Percent People Primary Other Booster Uknown"
# Looks like many start with "Percent" and some start with "Total" - this indicates there are different units of measure for these different variables!
Let’s rename the column “State/Territory/Federal Entity” in “vacc” to “Entity” using rename
. Make sure to reassign to vacc
here and in subsequent steps.
# General format
new_data <- old_data %>% rename(newname = oldname)
vacc <- vacc %>% rename(Entity = `State/Territory/Federal Entity`)
Select only the columns “Entity”, and those that start with “Percent”. Use select
and starts_with("Percent")
.
# General format
new_data <- old_data %>% select(colname1, colname2, ...)
vacc <- vacc %>% select(Entity, starts_with("Percent"))
Create a new dataset “vacc_long” that does pivot_longer()
on all columns except “Entity”. Remember that !Entity
means all columns except “Entity”.
# General format
new_data <- old_data %>% pivot_longer(cols = colname(s))
vacc_long <- vacc %>% pivot_longer(cols = !Entity)
Using vacc_long
, filter the “Entity” column so it only includes values in the following list: “Maryland”,“Virginia”,“Florida”,“Massachusetts”, “United States”. Hint: use filter
and %in%
.
# General format
new_data <- old_data %>% filter(colname %in% c(1, 2, 3, ...))
vacc_long <- vacc_long %>%
filter(Entity %in% c("Maryland", "Virginia", "Mississippi", "Massachusetts", "United States"))
Use pivot_wider
to reshape “vacc_long”. Use “Entity” for the names_from
argument. Use “value” for the values_from
argument. Call this new data vacc_wide
. Look at the data. How do these states compare to one another.
# General format
new_data <- old_data %>% pivot_wider(names_from = column1, values_from = column2)
vacc_wide <- vacc_long %>%
pivot_wider(
names_from = Entity,
values_from = value
)
vacc_wide
## # A tibble: 34 × 6
## name `United States` Massachusetts Maryland Mississippi Virginia
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Percent of Total… 74.6 92.4 81.6 56.4 80.2
## 2 Percent of 18+ P… 86.6 95 92.9 68.1 91.3
## 3 Percent of Total… 62.7 75.3 71.1 48.8 68.7
## 4 Percent of 18+ P… 73.4 84 81.6 59.5 78.5
## 5 Percent of Total… 74.6 92.4 81.6 56.4 80.2
## 6 Percent of 18+ P… 86.6 95 92.9 68.1 91.3
## 7 Percent of Total… 59.1 71.9 68 46.7 65.4
## 8 Percent of 18+ P… 68.7 79.7 77.6 56.7 74.3
## 9 Percent of 65+ P… 95 95 95 89.8 95
## 10 Percent of 65+ P… 88 93 92.6 82.1 90.1
## # ℹ 24 more rows
Take the code from Questions 1.1 and 1.3-1.7. Chain all of this code together using the pipe %>%
. Call your data vacc_compare
.
vacc_compare <-
read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv") %>%
rename(Entity = `State/Territory/Federal Entity`) %>%
select(Entity, starts_with("Percent")) %>%
pivot_longer(cols = !Entity) %>%
filter(Entity %in% c("Maryland", "Virginia", "Mississippi", "Massachusetts", "United States")) %>%
pivot_wider(names_from = Entity, values_from = value)
## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vacc_compare
## # A tibble: 34 × 6
## name `United States` Massachusetts Maryland Mississippi Virginia
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Percent of Total… 74.6 92.4 81.6 56.4 80.2
## 2 Percent of 18+ P… 86.6 95 92.9 68.1 91.3
## 3 Percent of Total… 62.7 75.3 71.1 48.8 68.7
## 4 Percent of 18+ P… 73.4 84 81.6 59.5 78.5
## 5 Percent of Total… 74.6 92.4 81.6 56.4 80.2
## 6 Percent of 18+ P… 86.6 95 92.9 68.1 91.3
## 7 Percent of Total… 59.1 71.9 68 46.7 65.4
## 8 Percent of 18+ P… 68.7 79.7 77.6 56.7 74.3
## 9 Percent of 65+ P… 95 95 95 89.8 95
## 10 Percent of 65+ P… 88 93 92.6 82.1 90.1
## # ℹ 24 more rows
Modify the code from Question P.1:
vacc_compare2
vacc_compare2 <-
read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv") %>%
rename(Entity = `State/Territory/Federal Entity`) %>%
select(Entity, starts_with("Total")) %>%
pivot_longer(cols = !Entity) %>%
filter(Entity %in% c("Alaska", "Kansas", "California", "United States")) %>%
pivot_wider(names_from = Entity, values_from = value)
## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vacc_compare2
## # A tibble: 18 × 5
## name `United States` Alaska California Kansas
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Total Doses Delivered 644652095 1357405 79693945 5368725
## 2 Total Doses Administered by State… 522482674 1043804 67991446 4182762
## 3 Total Number of Pfizer doses deli… 377143375 747845 47476225 3066565
## 4 Total Number of Moderna doses del… 237709320 526160 28618920 2051060
## 5 Total Number of Janssen doses del… 29799400 83400 3598800 251100
## 6 Total Number of doses from Other … 0 0 0 0
## 7 Total Number of Janssen doses adm… 17863666 41559 2230377 129839
## 8 Total Number of Moderna doses adm… 198923979 405736 25742821 1607286
## 9 Total Number of Pfizer doses admi… 305145563 595640 40002071 2442144
## 10 Total Number of doses from Other … 549466 869 16177 3493
## 11 Total Count People w/Booster Prim… 37991785 NA NA NA
## 12 Total Count People w/Booster Prim… 29736587 NA NA NA
## 13 Total Count People w/Booster Prim… 4007501 NA NA NA
## 14 Total Count People w/Booster Prim… NA NA NA NA
## 15 Total Count People w/Booster Boos… 38875578 NA NA NA
## 16 Total Count People w/Booster Boos… 31851324 NA NA NA
## 17 Total Count People w/Booster Boos… 1069038 NA NA NA
## 18 Total Count People w/Booster Boos… 16852 NA NA NA
Read in the GDP and Personal Income Data from http://jhudatascience.org/intro_to_r/data/gdp_personal_income.csv. You can use the url or download the data. Call it “gdp”.
gdp <- read_csv("http://jhudatascience.org/intro_to_r/data/gdp_personal_income.csv")
## Rows: 180 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): GeoName, Description
## dbl (1): 2020
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# If downloaded
# gdp <- read_csv("gdp_personal_income.csv")
Use pivot_wider
to reshape “gdp”. Use “Description” for the names_from
argument. Use “2020
” for the values_from
argument. Reassign this data to “gdp”.
You will need tick marks (``) around 2020.
# General format
new_data <- old_data %>% pivot_wider(names_from = column1, values_from = column2)
gdp <- gdp %>%
pivot_wider(
names_from = Description,
values_from = `2020`
)
Join the data. Keep only data that is found in both “vacc” and “gdp”.
by
argument - what happens?by = c("Entity" = "GeoName")
.# General format
new_data <- inner_join(x, y, by = c("colname1" = "colname2"))
# merged <- inner_join(vacc, gdp) does not work!
merged <- inner_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(merged)
## [1] 51
Change your code from Question 10 to do a full_join.
Call the output “full”. How many observations (rows) are there?
# General format
new_data <- full_join(x, y, by = c("colname1" = "colname2"))
full <- full_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(full)
## [1] 73
Do a left join of “vacc” and “gdp”. Call the output “left”. How many observations are there?
left <- left_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(left)
## [1] 64
Copy your code from Question P.3 and change it to a right_join
with the same order of the arguments. Call the output “right”. How many observations are there?
right <- right_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(right)
## [1] 60
Perform two anti_join
operations on “vacc” and “gdp” to determine what Entities are missing from the GDP data and which are missing from the vaccine data.
# General format
anti_join(L, R, by = c("name_L" = "name_R")) %>% select(name_L)
in_vacc_only <- anti_join(vacc, gdp, by = c("Entity" = "GeoName")) %>% select(Entity)
in_gdp_only <- anti_join(gdp, vacc, by = c("GeoName" = "Entity")) %>% select(GeoName)
in_vacc_only
## # A tibble: 13 × 1
## Entity
## <chr>
## 1 American Samoa
## 2 Bureau of Prisons
## 3 Dept of Defense
## 4 Federated States of Micronesia
## 5 Guam
## 6 Indian Health Svc
## 7 Marshall Islands
## 8 Northern Mariana Islands
## 9 New York State
## 10 Puerto Rico
## 11 Republic of Palau
## 12 Veterans Health
## 13 Virgin Islands
in_gdp_only
## # A tibble: 9 × 1
## GeoName
## <chr>
## 1 New York
## 2 New England
## 3 Mideast
## 4 Great Lakes
## 5 Plains
## 6 Southeast
## 7 Southwest
## 8 Rocky Mountain
## 9 Far West