Data in this lab comes from the CDC (https://covid.cdc.gov/covid-data-tracker/#vaccinations_vacc-total-admin-rate-total - snapshot from January 12, 2022) and the Bureau of Economic Analysis (https://www.bea.gov/data/income-saving/personal-income-by-state).

library(readr)
library(dplyr)
library(tidyr)

Part 1

1.1

Read in the SARS-CoV-2 Vaccination data from http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv. You can use the url or download the data. Assign the data the name “vacc”. We will be reviewing new concepts here and incorporating some from week 1.

  • Remember to use read_csv() from the readr package.
  • Do NOT use read.csv().
vacc <- read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv")
## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl   (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# If downloaded
# vacc <- read_csv("USA_covid19_vaccinations.csv")

1.2

Look at the column names using colnames - do you notice any patterns?

colnames(vacc)
##   [1] "State/Territory/Federal Entity"                                    
##   [2] "Total Doses Delivered"                                             
##   [3] "Doses Delivered per 100K"                                          
##   [4] "18+ Doses Delivered per 100K"                                      
##   [5] "Total Doses Administered by State where Administered"              
##   [6] "Doses Administered per 100k by State where Administered"           
##   [7] "18+ Doses Administered by State where Administered"                
##   [8] "18+ Doses Administered per 100K by State where Administered"       
##   [9] "People with at least One Dose by State of Residence"               
##  [10] "Percent of Total Pop with at least One Dose by State of Residence" 
##  [11] "People 18+ with at least One Dose by State of Residence"           
##  [12] "Percent of 18+ Pop with at least One Dose by State of Residence"   
##  [13] "People Fully Vaccinated by State of Residence"                     
##  [14] "Percent of Total Pop Fully Vaccinated by State of Residence"       
##  [15] "People 18+ Fully Vaccinated by State of Residence"                 
##  [16] "Percent of 18+ Pop Fully Vaccinated by State of Residence"         
##  [17] "Total Number of Pfizer doses delivered"                            
##  [18] "Total Number of Moderna doses delivered"                           
##  [19] "Total Number of Janssen doses delivered"                           
##  [20] "Total Number of doses from Other manufacturer delivered"           
##  [21] "Total Number of Janssen doses administered"                        
##  [22] "Total Number of Moderna doses administered"                        
##  [23] "Total Number of Pfizer doses adminstered"                          
##  [24] "Total Number of doses from Other manufacturer administered"        
##  [25] "People Fully Vaccinated Moderna Resident"                          
##  [26] "People Fully Vaccinated Pfizer Resident"                           
##  [27] "People Fully Vaccinated Janssen Resident"                          
##  [28] "People Fully Vaccinated Other 2-dose manufacturer Resident"        
##  [29] "People 18+ Fully Vaccinated Moderna Resident"                      
##  [30] "People 18+ Fully Vaccinated Pfizer Resident"                       
##  [31] "People 18+ Fully Vaccinated Janssen Resident"                      
##  [32] "People 18+ Fully Vaccinated Other 2-dose manufacturer Resident"    
##  [33] "People with 2 Doses by State of Residence"                         
##  [34] "Percent of Total Pop with 1+ Doses by State of Residence"          
##  [35] "People 18+ with 1+ Doses by State of Residence"                    
##  [36] "Percent of 18+ Pop with 1+ Doses by State of Residence"            
##  [37] "Percent of Total Pop with 2 Doses by State of Residence"           
##  [38] "People 18+ with 2 Doses by State of Residence"                     
##  [39] "Percent of 18+ Pop with 2 Doses by State of Residence"             
##  [40] "People with 1+ Doses by State of Residence"                        
##  [41] "People 65+ with at least One Dose by State of Residence"           
##  [42] "Percent of 65+ Pop with at least One Dose by State of Residence"   
##  [43] "People 65+ Fully Vaccinated by State of Residence"                 
##  [44] "Percent of 65+ Pop Fully Vaccinated by State of Residence"         
##  [45] "People 65+ Fully Vaccinated_Moderna_Resident"                      
##  [46] "People 65+ Fully Vaccinated_Pfizer_Resident"                       
##  [47] "People 65+ Fully Vaccinated_Janssen_Resident"                      
##  [48] "People 65+ Fully Vaccinated_Other 2-dose Manuf_Resident"           
##  [49] "65+ Doses Administered by State where Administered"                
##  [50] "Doses Administered per 100k of 65+ pop by State where Administered"
##  [51] "Doses Delivered per 100k of 65+ pop"                               
##  [52] "People 12+ with at least One Dose by State of Residence"           
##  [53] "Percent of 12+ Pop with at least One Dose by State of Residence"   
##  [54] "People 12+ Fully Vaccinated by State of Residence"                 
##  [55] "Percent of 12+ Pop Fully Vaccinated by State of Residence"         
##  [56] "People 12+ Fully Vaccinated_Moderna_Resident"                      
##  [57] "People 12+ Fully Vaccinated_Pfizer_Resident"                       
##  [58] "People 12+ Fully Vaccinated_Janssen_Resident"                      
##  [59] "People 12+ Fully Vaccinated_Other 2-dose Manuf_Resident"           
##  [60] "12+ Doses Administered by State where Administered"                
##  [61] "Doses Administered per 100k of 12+ pop by State where Administered"
##  [62] "Doses Delivered per 100k of 12+ pop"                               
##  [63] "People 5+ with at least One Dose by State of Residence"            
##  [64] "Percent of 5+ Pop with at least One Dose by State of Residence"    
##  [65] "People 5+ Fully Vaccinated by State of Residence"                  
##  [66] "Percent of 5+ Pop Fully Vaccinated by State of Residence"          
##  [67] "People 5+ Fully Vaccinated_Moderna_Resident"                       
##  [68] "People 5+ Fully Vaccinated_Pfizer_Resident"                        
##  [69] "People 5+ Fully Vaccinated_Janssen_Resident"                       
##  [70] "People 5+ Fully Vaccinated_Other 2-dose Manuf_Resident"            
##  [71] "5+ Doses Administered by State where Administered"                 
##  [72] "Doses Administered per 100k of 5+ pop  by State where Administered"
##  [73] "Doses Delivered per 100k of 5+ pop"                                
##  [74] "People who have received a booster dose"                           
##  [75] "Percent of fully vaccinated people with booster doses"             
##  [76] "People 18+ who have received a booster dose"                       
##  [77] "Percent of fully vaccinated people 18+ with booster doses"         
##  [78] "People 50+ who have received a booster dose"                       
##  [79] "Percent of fully vaccinated people 50+ with booster doses"         
##  [80] "People 65+ who have received a booster dose"                       
##  [81] "Percent of fully vaccinated people 65+ with booster doses"         
##  [82] "People with Moderna booster dose"                                  
##  [83] "People with Pfizer booster dose"                                   
##  [84] "People with Janssen booster dose"                                  
##  [85] "People with booster dose of an Other manufacturer"                 
##  [86] "Total Count People w/Booster Primary Pfizer Minus TX"              
##  [87] "Total Count People w/Booster Primary Moderna Minus TX"             
##  [88] "Total Count People w/Booster Primary J&J Minus TX"                 
##  [89] "Total Count People w/Booster Primary Other Minus TX"               
##  [90] "Total Count People w/Booster Booster Pfizer Minus TX"              
##  [91] "Total Count People w/Booster Booster Moderna Minus TX"             
##  [92] "Total Count People w/Booster Booster J&J Minus TX"                 
##  [93] "Total Count People w/Booster Booster Other Minus TX"               
##  [94] "Count People Primary Pfizer Booster Pfizer"                        
##  [95] "Count People Primary Pfizer Booster Moderna"                       
##  [96] "Count People Primary Pfizer Booster J&J"                           
##  [97] "Count People Primary Pfizer Booster Uknown"                        
##  [98] "Count People Primary Moderna Booster Pfizer"                       
##  [99] "Count People Primary Moderna Booster Moderna"                      
## [100] "Count People Primary Moderna Booster J&J"                          
## [101] "Count People Primary Moderna Booster Uknown"                       
## [102] "Count People Primary J&J Booster Pfizer"                           
## [103] "Count People Primary J&J Booster Moderna"                          
## [104] "Count People Primary J&J Booster J&J"                              
## [105] "Count People Primary J&J Booster Other"                            
## [106] "Count People Primary Other Booster Pfizer"                         
## [107] "Count People Primary Other Booster Moderna"                        
## [108] "Count People Primary Other Booster J&J"                            
## [109] "Count People Primary Other Booster Other"                          
## [110] "Percent People Primary Pfizer Booster Pfizer"                      
## [111] "Percent People Primary Pfizer Booster Moderna"                     
## [112] "Percent People Primary Pfizer Booster J&J"                         
## [113] "Percent People Primary Pfizer Booster Other"                       
## [114] "Percent People Primary Moderna Booster Pfizer"                     
## [115] "Percent People Primary Moderna Booster Moderna"                    
## [116] "Percent People Primary Moderna Booster J&J"                        
## [117] "Percent People Primary Moderna Booster Other"                      
## [118] "Percent People Primary J&J Booster Pfizer"                         
## [119] "Percent People Primary J&J Booster Moderna"                        
## [120] "Percent People Primary J&J Booster J&J"                            
## [121] "Percent People Primary J&J Booster Other"                          
## [122] "Percent People Primary Other Booster Pfizer"                       
## [123] "Percent People Primary Other Booster Moderna"                      
## [124] "Percent People Primary Other Booster J&J"                          
## [125] "Percent People Primary Other Booster Uknown"
# Looks like many start with "Percent" and some start with "Total" - this indicates there are different units of measure for these different variables!

1.3

Let’s rename the column “State/Territory/Federal Entity” in “vacc” to “Entity” using rename. Make sure to reassign to vacc here and in subsequent steps.

# General format
new_data <- old_data %>% rename(newname = oldname)
vacc <- vacc %>% rename(Entity = `State/Territory/Federal Entity`)

1.4

Select only the columns “Entity”, and those that start with “Percent”. Use select and starts_with("Percent").

# General format
new_data <- old_data %>% select(colname1, colname2, ...)
vacc <- vacc %>% select(Entity, starts_with("Percent"))

1.5

Create a new dataset “vacc_long” that does pivot_longer() on all columns except “Entity”. Remember that !Entity means all columns except “Entity”.

# General format
new_data <- old_data %>% pivot_longer(cols = colname(s))
vacc_long <- vacc %>% pivot_longer(cols = !Entity)

1.6

Using vacc_long, filter the “Entity” column so it only includes values in the following list: “Maryland”,“Virginia”,“Florida”,“Massachusetts”, “United States”. Hint: use filter and %in%.

# General format
new_data <- old_data %>% filter(colname %in% c(1, 2, 3, ...))
vacc_long <- vacc_long %>%
  filter(Entity %in% c("Maryland", "Virginia", "Mississippi", "Massachusetts", "United States"))

1.7

Use pivot_wider to reshape “vacc_long”. Use “Entity” for the names_from argument. Use “value” for the values_from argument. Call this new data vacc_wide. Look at the data. How do these states compare to one another.

# General format
new_data <- old_data %>% pivot_wider(names_from = column1, values_from = column2)
vacc_wide <- vacc_long %>%
  pivot_wider(
    names_from = Entity,
    values_from = value
  )
vacc_wide
## # A tibble: 34 × 6
##    name              `United States` Massachusetts Maryland Mississippi Virginia
##    <chr>                       <dbl>         <dbl>    <dbl>       <dbl>    <dbl>
##  1 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  2 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  3 Percent of Total…            62.7          75.3     71.1        48.8     68.7
##  4 Percent of 18+ P…            73.4          84       81.6        59.5     78.5
##  5 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  6 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  7 Percent of Total…            59.1          71.9     68          46.7     65.4
##  8 Percent of 18+ P…            68.7          79.7     77.6        56.7     74.3
##  9 Percent of 65+ P…            95            95       95          89.8     95  
## 10 Percent of 65+ P…            88            93       92.6        82.1     90.1
## # ℹ 24 more rows

Practice on Your Own!

P.1

Take the code from Questions 1.1 and 1.3-1.7. Chain all of this code together using the pipe %>%. Call your data vacc_compare.

vacc_compare <-
  read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv") %>%
  rename(Entity = `State/Territory/Federal Entity`) %>%
  select(Entity, starts_with("Percent")) %>%
  pivot_longer(cols = !Entity) %>%
  filter(Entity %in% c("Maryland", "Virginia", "Mississippi", "Massachusetts", "United States")) %>%
  pivot_wider(names_from = Entity, values_from = value)
## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl   (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vacc_compare
## # A tibble: 34 × 6
##    name              `United States` Massachusetts Maryland Mississippi Virginia
##    <chr>                       <dbl>         <dbl>    <dbl>       <dbl>    <dbl>
##  1 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  2 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  3 Percent of Total…            62.7          75.3     71.1        48.8     68.7
##  4 Percent of 18+ P…            73.4          84       81.6        59.5     78.5
##  5 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  6 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  7 Percent of Total…            59.1          71.9     68          46.7     65.4
##  8 Percent of 18+ P…            68.7          79.7     77.6        56.7     74.3
##  9 Percent of 65+ P…            95            95       95          89.8     95  
## 10 Percent of 65+ P…            88            93       92.6        82.1     90.1
## # ℹ 24 more rows

P.2

Modify the code from Question P.1:

  • Look for columns that start with “Total” (instead of “Percent”) and
  • Select different states/Entities to compare
  • Call your data vacc_compare2
vacc_compare2 <-
  read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv") %>%
  rename(Entity = `State/Territory/Federal Entity`) %>%
  select(Entity, starts_with("Total")) %>%
  pivot_longer(cols = !Entity) %>%
  filter(Entity %in% c("Alaska", "Kansas", "California", "United States")) %>%
  pivot_wider(names_from = Entity, values_from = value)
## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl   (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
vacc_compare2
## # A tibble: 18 × 5
##    name                               `United States`  Alaska California  Kansas
##    <chr>                                        <dbl>   <dbl>      <dbl>   <dbl>
##  1 Total Doses Delivered                    644652095 1357405   79693945 5368725
##  2 Total Doses Administered by State…       522482674 1043804   67991446 4182762
##  3 Total Number of Pfizer doses deli…       377143375  747845   47476225 3066565
##  4 Total Number of Moderna doses del…       237709320  526160   28618920 2051060
##  5 Total Number of Janssen doses del…        29799400   83400    3598800  251100
##  6 Total Number of doses from Other …               0       0          0       0
##  7 Total Number of Janssen doses adm…        17863666   41559    2230377  129839
##  8 Total Number of Moderna doses adm…       198923979  405736   25742821 1607286
##  9 Total Number of Pfizer doses admi…       305145563  595640   40002071 2442144
## 10 Total Number of doses from Other …          549466     869      16177    3493
## 11 Total Count People w/Booster Prim…        37991785      NA         NA      NA
## 12 Total Count People w/Booster Prim…        29736587      NA         NA      NA
## 13 Total Count People w/Booster Prim…         4007501      NA         NA      NA
## 14 Total Count People w/Booster Prim…              NA      NA         NA      NA
## 15 Total Count People w/Booster Boos…        38875578      NA         NA      NA
## 16 Total Count People w/Booster Boos…        31851324      NA         NA      NA
## 17 Total Count People w/Booster Boos…         1069038      NA         NA      NA
## 18 Total Count People w/Booster Boos…           16852      NA         NA      NA

Part 2

2.1

Read in the GDP and Personal Income Data from http://jhudatascience.org/intro_to_r/data/gdp_personal_income.csv. You can use the url or download the data. Call it “gdp”.

gdp <- read_csv("http://jhudatascience.org/intro_to_r/data/gdp_personal_income.csv")
## Rows: 180 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): GeoName, Description
## dbl (1): 2020
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# If downloaded
# gdp <- read_csv("gdp_personal_income.csv")

2.2

Use pivot_wider to reshape “gdp”. Use “Description” for the names_from argument. Use “2020” for the values_from argument. Reassign this data to “gdp”.

# General format
new_data <- old_data %>% pivot_wider(names_from = column1, values_from = column2)
gdp <- gdp %>%
  pivot_wider(
    names_from = Description,
    values_from = `2020`
  )

2.3

Join the data. Keep only data that is found in both “vacc” and “gdp”.

  • First, try joining without using the by argument - what happens?
  • Next, try joining using by = c("Entity" = "GeoName").
  • Call the output “merged”. How many observations (rows) are there?
# General format
new_data <- inner_join(x, y, by = c("colname1" = "colname2"))
# merged <- inner_join(vacc, gdp) does not work!
merged <- inner_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(merged)
## [1] 51

2.4

Change your code from Question 10 to do a full_join. Call the output “full”. How many observations (rows) are there?

# General format
new_data <- full_join(x, y, by = c("colname1" = "colname2"))
full <- full_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(full)
## [1] 73

Practice on Your Own!

P.3

Do a left join of “vacc” and “gdp”. Call the output “left”. How many observations are there?

left <- left_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(left)
## [1] 64

P.4

Copy your code from Question P.3 and change it to a right_join with the same order of the arguments. Call the output “right”. How many observations are there?

right <- right_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(right)
## [1] 60

P.5

Perform two anti_join operations on “vacc” and “gdp” to determine what Entities are missing from the GDP data and which are missing from the vaccine data.

# General format
anti_join(L, R, by = c("name_L" = "name_R")) %>% select(name_L)
in_vacc_only <- anti_join(vacc, gdp, by = c("Entity" = "GeoName")) %>% select(Entity)
in_gdp_only <- anti_join(gdp, vacc, by = c("GeoName" = "Entity")) %>% select(GeoName)
in_vacc_only
## # A tibble: 13 × 1
##    Entity                        
##    <chr>                         
##  1 American Samoa                
##  2 Bureau of Prisons             
##  3 Dept of Defense               
##  4 Federated States of Micronesia
##  5 Guam                          
##  6 Indian Health Svc             
##  7 Marshall Islands              
##  8 Northern Mariana Islands      
##  9 New York State                
## 10 Puerto Rico                   
## 11 Republic of Palau             
## 12 Veterans Health               
## 13 Virgin Islands
in_gdp_only
## # A tibble: 9 × 1
##   GeoName       
##   <chr>         
## 1 New York      
## 2 New England   
## 3 Mideast       
## 4 Great Lakes   
## 5 Plains        
## 6 Southeast     
## 7 Southwest     
## 8 Rocky Mountain
## 9 Far West