Data in this lab comes from the CDC (https://covid.cdc.gov/covid-data-tracker/#vaccinations_vacc-total-admin-rate-total - snapshot from January 12, 2022) and the Bureau of Economic Analysis (https://www.bea.gov/data/income-saving/personal-income-by-state).

library(tidyverse)

Part 1

1.1

Read in the SARS-CoV-2 Vaccination data from http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv. You can use the url or download the data. Assign the data the name “vacc”. We will be reviewing new concepts here and incorporating some from week 1.

Remember to use read_csv() from the readr package.
Do NOT use read.csv().

vacc <- read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv")

## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl   (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# If downloaded
# vacc <- read_csv("USA_covid19_vaccinations.csv")

1.2

Look at the column names using colnames - do you notice any patterns?

colnames(vacc)

##   [1] "State/Territory/Federal Entity"                                    
##   [2] "Total Doses Delivered"                                             
##   [3] "Doses Delivered per 100K"                                          
##   [4] "18+ Doses Delivered per 100K"                                      
##   [5] "Total Doses Administered by State where Administered"              
##   [6] "Doses Administered per 100k by State where Administered"           
##   [7] "18+ Doses Administered by State where Administered"                
##   [8] "18+ Doses Administered per 100K by State where Administered"       
##   [9] "People with at least One Dose by State of Residence"               
##  [10] "Percent of Total Pop with at least One Dose by State of Residence" 
##  [11] "People 18+ with at least One Dose by State of Residence"           
##  [12] "Percent of 18+ Pop with at least One Dose by State of Residence"   
##  [13] "People Fully Vaccinated by State of Residence"                     
##  [14] "Percent of Total Pop Fully Vaccinated by State of Residence"       
##  [15] "People 18+ Fully Vaccinated by State of Residence"                 
##  [16] "Percent of 18+ Pop Fully Vaccinated by State of Residence"         
##  [17] "Total Number of Pfizer doses delivered"                            
##  [18] "Total Number of Moderna doses delivered"                           
##  [19] "Total Number of Janssen doses delivered"                           
##  [20] "Total Number of doses from Other manufacturer delivered"           
##  [21] "Total Number of Janssen doses administered"                        
##  [22] "Total Number of Moderna doses administered"                        
##  [23] "Total Number of Pfizer doses adminstered"                          
##  [24] "Total Number of doses from Other manufacturer administered"        
##  [25] "People Fully Vaccinated Moderna Resident"                          
##  [26] "People Fully Vaccinated Pfizer Resident"                           
##  [27] "People Fully Vaccinated Janssen Resident"                          
##  [28] "People Fully Vaccinated Other 2-dose manufacturer Resident"        
##  [29] "People 18+ Fully Vaccinated Moderna Resident"                      
##  [30] "People 18+ Fully Vaccinated Pfizer Resident"                       
##  [31] "People 18+ Fully Vaccinated Janssen Resident"                      
##  [32] "People 18+ Fully Vaccinated Other 2-dose manufacturer Resident"    
##  [33] "People with 2 Doses by State of Residence"                         
##  [34] "Percent of Total Pop with 1+ Doses by State of Residence"          
##  [35] "People 18+ with 1+ Doses by State of Residence"                    
##  [36] "Percent of 18+ Pop with 1+ Doses by State of Residence"            
##  [37] "Percent of Total Pop with 2 Doses by State of Residence"           
##  [38] "People 18+ with 2 Doses by State of Residence"                     
##  [39] "Percent of 18+ Pop with 2 Doses by State of Residence"             
##  [40] "People with 1+ Doses by State of Residence"                        
##  [41] "People 65+ with at least One Dose by State of Residence"           
##  [42] "Percent of 65+ Pop with at least One Dose by State of Residence"   
##  [43] "People 65+ Fully Vaccinated by State of Residence"                 
##  [44] "Percent of 65+ Pop Fully Vaccinated by State of Residence"         
##  [45] "People 65+ Fully Vaccinated_Moderna_Resident"                      
##  [46] "People 65+ Fully Vaccinated_Pfizer_Resident"                       
##  [47] "People 65+ Fully Vaccinated_Janssen_Resident"                      
##  [48] "People 65+ Fully Vaccinated_Other 2-dose Manuf_Resident"           
##  [49] "65+ Doses Administered by State where Administered"                
##  [50] "Doses Administered per 100k of 65+ pop by State where Administered"
##  [51] "Doses Delivered per 100k of 65+ pop"                               
##  [52] "People 12+ with at least One Dose by State of Residence"           
##  [53] "Percent of 12+ Pop with at least One Dose by State of Residence"   
##  [54] "People 12+ Fully Vaccinated by State of Residence"                 
##  [55] "Percent of 12+ Pop Fully Vaccinated by State of Residence"         
##  [56] "People 12+ Fully Vaccinated_Moderna_Resident"                      
##  [57] "People 12+ Fully Vaccinated_Pfizer_Resident"                       
##  [58] "People 12+ Fully Vaccinated_Janssen_Resident"                      
##  [59] "People 12+ Fully Vaccinated_Other 2-dose Manuf_Resident"           
##  [60] "12+ Doses Administered by State where Administered"                
##  [61] "Doses Administered per 100k of 12+ pop by State where Administered"
##  [62] "Doses Delivered per 100k of 12+ pop"                               
##  [63] "People 5+ with at least One Dose by State of Residence"            
##  [64] "Percent of 5+ Pop with at least One Dose by State of Residence"    
##  [65] "People 5+ Fully Vaccinated by State of Residence"                  
##  [66] "Percent of 5+ Pop Fully Vaccinated by State of Residence"          
##  [67] "People 5+ Fully Vaccinated_Moderna_Resident"                       
##  [68] "People 5+ Fully Vaccinated_Pfizer_Resident"                        
##  [69] "People 5+ Fully Vaccinated_Janssen_Resident"                       
##  [70] "People 5+ Fully Vaccinated_Other 2-dose Manuf_Resident"            
##  [71] "5+ Doses Administered by State where Administered"                 
##  [72] "Doses Administered per 100k of 5+ pop  by State where Administered"
##  [73] "Doses Delivered per 100k of 5+ pop"                                
##  [74] "People who have received a booster dose"                           
##  [75] "Percent of fully vaccinated people with booster doses"             
##  [76] "People 18+ who have received a booster dose"                       
##  [77] "Percent of fully vaccinated people 18+ with booster doses"         
##  [78] "People 50+ who have received a booster dose"                       
##  [79] "Percent of fully vaccinated people 50+ with booster doses"         
##  [80] "People 65+ who have received a booster dose"                       
##  [81] "Percent of fully vaccinated people 65+ with booster doses"         
##  [82] "People with Moderna booster dose"                                  
##  [83] "People with Pfizer booster dose"                                   
##  [84] "People with Janssen booster dose"                                  
##  [85] "People with booster dose of an Other manufacturer"                 
##  [86] "Total Count People w/Booster Primary Pfizer Minus TX"              
##  [87] "Total Count People w/Booster Primary Moderna Minus TX"             
##  [88] "Total Count People w/Booster Primary J&J Minus TX"                 
##  [89] "Total Count People w/Booster Primary Other Minus TX"               
##  [90] "Total Count People w/Booster Booster Pfizer Minus TX"              
##  [91] "Total Count People w/Booster Booster Moderna Minus TX"             
##  [92] "Total Count People w/Booster Booster J&J Minus TX"                 
##  [93] "Total Count People w/Booster Booster Other Minus TX"               
##  [94] "Count People Primary Pfizer Booster Pfizer"                        
##  [95] "Count People Primary Pfizer Booster Moderna"                       
##  [96] "Count People Primary Pfizer Booster J&J"                           
##  [97] "Count People Primary Pfizer Booster Uknown"                        
##  [98] "Count People Primary Moderna Booster Pfizer"                       
##  [99] "Count People Primary Moderna Booster Moderna"                      
## [100] "Count People Primary Moderna Booster J&J"                          
## [101] "Count People Primary Moderna Booster Uknown"                       
## [102] "Count People Primary J&J Booster Pfizer"                           
## [103] "Count People Primary J&J Booster Moderna"                          
## [104] "Count People Primary J&J Booster J&J"                              
## [105] "Count People Primary J&J Booster Other"                            
## [106] "Count People Primary Other Booster Pfizer"                         
## [107] "Count People Primary Other Booster Moderna"                        
## [108] "Count People Primary Other Booster J&J"                            
## [109] "Count People Primary Other Booster Other"                          
## [110] "Percent People Primary Pfizer Booster Pfizer"                      
## [111] "Percent People Primary Pfizer Booster Moderna"                     
## [112] "Percent People Primary Pfizer Booster J&J"                         
## [113] "Percent People Primary Pfizer Booster Other"                       
## [114] "Percent People Primary Moderna Booster Pfizer"                     
## [115] "Percent People Primary Moderna Booster Moderna"                    
## [116] "Percent People Primary Moderna Booster J&J"                        
## [117] "Percent People Primary Moderna Booster Other"                      
## [118] "Percent People Primary J&J Booster Pfizer"                         
## [119] "Percent People Primary J&J Booster Moderna"                        
## [120] "Percent People Primary J&J Booster J&J"                            
## [121] "Percent People Primary J&J Booster Other"                          
## [122] "Percent People Primary Other Booster Pfizer"                       
## [123] "Percent People Primary Other Booster Moderna"                      
## [124] "Percent People Primary Other Booster J&J"                          
## [125] "Percent People Primary Other Booster Uknown"

# Looks like many start with "Percent" and some start with "Total" - this indicates there are different units of measure for these different variables!

1.3

Let’s rename the column “State/Territory/Federal Entity” in “vacc” to “Entity” using rename. Make sure to reassign to vacc here and in subsequent steps.

# General format
new_data <- old_data %>% rename(newname = oldname)

vacc <- vacc %>% rename(Entity = `State/Territory/Federal Entity`)

1.4

Select only the columns “Entity”, and those that start with “Percent”. Use select and starts_with("Percent").

# General format
new_data <- old_data %>% select(colname1, colname2, ...)

vacc <- vacc %>% select(Entity, starts_with("Percent"))

1.5

Create a new dataset “vacc_long” that does pivot_longer() on all columns except “Entity”. Remember that !Entity means all columns except “Entity”.

# General format
new_data <- old_data %>% pivot_longer(cols = colname(s))

vacc_long <- vacc %>% pivot_longer(cols = !Entity)

1.6

Using vacc_long, filter the “Entity” column so it only includes values in the following list: “Maryland”,“Virginia”,“Florida”,“Massachusetts”, “United States”. Hint: use filter and %in%.

# General format
new_data <- old_data %>% filter(colname %in% c(1, 2, 3, ...))

vacc_long <- vacc_long %>%
  filter(Entity %in% c("Maryland", "Virginia", "Mississippi", "Massachusetts", "United States"))

1.7

Use pivot_wider to reshape “vacc_long”. Use “Entity” for the names_from argument. Use “value” for the values_from argument. Call this new data vacc_wide. Look at the data. How do these states compare to one another.

# General format
new_data <- old_data %>% pivot_wider(names_from = column1, values_from = column2)

vacc_wide <- vacc_long %>%
  pivot_wider(
    names_from = Entity,
    values_from = value
  )
vacc_wide

## # A tibble: 34 × 6
##    name              `United States` Massachusetts Maryland Mississippi Virginia
##    <chr>                       <dbl>         <dbl>    <dbl>       <dbl>    <dbl>
##  1 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  2 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  3 Percent of Total…            62.7          75.3     71.1        48.8     68.7
##  4 Percent of 18+ P…            73.4          84       81.6        59.5     78.5
##  5 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  6 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  7 Percent of Total…            59.1          71.9     68          46.7     65.4
##  8 Percent of 18+ P…            68.7          79.7     77.6        56.7     74.3
##  9 Percent of 65+ P…            95            95       95          89.8     95  
## 10 Percent of 65+ P…            88            93       92.6        82.1     90.1
## # ℹ 24 more rows

Practice on Your Own!

P.1

Take the code from Questions 1.1 and 1.3-1.7. Chain all of this code together using the pipe %>%. Call your data vacc_compare.

vacc_compare <-
  read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv") %>%
  rename(Entity = `State/Territory/Federal Entity`) %>%
  select(Entity, starts_with("Percent")) %>%
  pivot_longer(cols = !Entity) %>%
  filter(Entity %in% c("Maryland", "Virginia", "Mississippi", "Massachusetts", "United States")) %>%
  pivot_wider(names_from = Entity, values_from = value)

## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl   (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

vacc_compare

## # A tibble: 34 × 6
##    name              `United States` Massachusetts Maryland Mississippi Virginia
##    <chr>                       <dbl>         <dbl>    <dbl>       <dbl>    <dbl>
##  1 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  2 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  3 Percent of Total…            62.7          75.3     71.1        48.8     68.7
##  4 Percent of 18+ P…            73.4          84       81.6        59.5     78.5
##  5 Percent of Total…            74.6          92.4     81.6        56.4     80.2
##  6 Percent of 18+ P…            86.6          95       92.9        68.1     91.3
##  7 Percent of Total…            59.1          71.9     68          46.7     65.4
##  8 Percent of 18+ P…            68.7          79.7     77.6        56.7     74.3
##  9 Percent of 65+ P…            95            95       95          89.8     95  
## 10 Percent of 65+ P…            88            93       92.6        82.1     90.1
## # ℹ 24 more rows

P.2

Modify the code from Question P.1:

Look for columns that start with “Total” (instead of “Percent”) and
Select different states/Entities to compare
Call your data vacc_compare2

vacc_compare2 <-
  read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv") %>%
  rename(Entity = `State/Territory/Federal Entity`) %>%
  select(Entity, starts_with("Total")) %>%
  pivot_longer(cols = !Entity) %>%
  filter(Entity %in% c("Alaska", "Kansas", "California", "United States")) %>%
  pivot_wider(names_from = Entity, values_from = value)

## Rows: 64 Columns: 125
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): State/Territory/Federal Entity
## dbl (121): Total Doses Delivered, Doses Delivered per 100K, 18+ Doses Delive...
## lgl   (3): People with 2 Doses by State of Residence, People 18+ with 1+ Dos...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

vacc_compare2

## # A tibble: 18 × 5
##    name                               `United States`  Alaska California  Kansas
##    <chr>                                        <dbl>   <dbl>      <dbl>   <dbl>
##  1 Total Doses Delivered                    644652095 1357405   79693945 5368725
##  2 Total Doses Administered by State…       522482674 1043804   67991446 4182762
##  3 Total Number of Pfizer doses deli…       377143375  747845   47476225 3066565
##  4 Total Number of Moderna doses del…       237709320  526160   28618920 2051060
##  5 Total Number of Janssen doses del…        29799400   83400    3598800  251100
##  6 Total Number of doses from Other …               0       0          0       0
##  7 Total Number of Janssen doses adm…        17863666   41559    2230377  129839
##  8 Total Number of Moderna doses adm…       198923979  405736   25742821 1607286
##  9 Total Number of Pfizer doses admi…       305145563  595640   40002071 2442144
## 10 Total Number of doses from Other …          549466     869      16177    3493
## 11 Total Count People w/Booster Prim…        37991785      NA         NA      NA
## 12 Total Count People w/Booster Prim…        29736587      NA         NA      NA
## 13 Total Count People w/Booster Prim…         4007501      NA         NA      NA
## 14 Total Count People w/Booster Prim…              NA      NA         NA      NA
## 15 Total Count People w/Booster Boos…        38875578      NA         NA      NA
## 16 Total Count People w/Booster Boos…        31851324      NA         NA      NA
## 17 Total Count People w/Booster Boos…         1069038      NA         NA      NA
## 18 Total Count People w/Booster Boos…           16852      NA         NA      NA

Part 2

2.1

Read in the GDP and Personal Income Data from http://jhudatascience.org/intro_to_r/data/gdp_personal_income.csv. You can use the url or download the data. Call it “gdp”.

gdp <- read_csv("http://jhudatascience.org/intro_to_r/data/gdp_personal_income.csv")

## Rows: 180 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): GeoName, Description
## dbl (1): 2020
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# If downloaded
# gdp <- read_csv("gdp_personal_income.csv")

2.2

Use pivot_wider to reshape “gdp”. Use “Description” for the names_from argument. Use “2020” for the values_from argument. Reassign this data to “gdp”.

You will need tick marks (``) around 2020.

# General format
new_data <- old_data %>% pivot_wider(names_from = column1, values_from = column2)

gdp <- gdp %>%
  pivot_wider(
    names_from = Description,
    values_from = `2020`
  )

2.3

Join the data. Keep only data that is found in both “vacc” and “gdp”.

First, try joining without using the by argument - what happens?
Next, try joining using by = c("Entity" = "GeoName").
Call the output “merged”. How many observations (rows) are there?

# General format
new_data <- inner_join(x, y, by = c("colname1" = "colname2"))

# merged <- inner_join(vacc, gdp) does not work!
merged <- inner_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(merged)

## [1] 51

2.4

Change your code from Question 10 to do a full_join. Call the output “full”. How many observations (rows) are there?

# General format
new_data <- full_join(x, y, by = c("colname1" = "colname2"))

full <- full_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(full)

## [1] 73

Practice on Your Own!

P.3

Do a left join of “vacc” and “gdp”. Call the output “left”. How many observations are there?

left <- left_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(left)

## [1] 64

P.4

Copy your code from Question P.3 and change it to a right_join with the same order of the arguments. Call the output “right”. How many observations are there?

right <- right_join(vacc, gdp, by = c("Entity" = "GeoName"))
nrow(right)

## [1] 60

P.5

Perform two anti_join operations on “vacc” and “gdp” to determine what Entities are missing from the GDP data and which are missing from the vaccine data.

# General format
anti_join(L, R, by = c("name_L" = "name_R")) %>% select(name_L)

in_vacc_only <- anti_join(vacc, gdp, by = c("Entity" = "GeoName")) %>% select(Entity)
in_gdp_only <- anti_join(gdp, vacc, by = c("GeoName" = "Entity")) %>% select(GeoName)
in_vacc_only

## # A tibble: 13 × 1
##    Entity                        
##    <chr>                         
##  1 American Samoa                
##  2 Bureau of Prisons             
##  3 Dept of Defense               
##  4 Federated States of Micronesia
##  5 Guam                          
##  6 Indian Health Svc             
##  7 Marshall Islands              
##  8 Northern Mariana Islands      
##  9 New York State                
## 10 Puerto Rico                   
## 11 Republic of Palau             
## 12 Veterans Health               
## 13 Virgin Islands

in_gdp_only

## # A tibble: 9 × 1
##   GeoName       
##   <chr>         
## 1 New York      
## 2 New England   
## 3 Mideast       
## 4 Great Lakes   
## 5 Plains        
## 6 Southeast     
## 7 Southwest     
## 8 Rocky Mountain
## 9 Far West

Manipulating Data in R Lab - Key

Part 1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

Practice on Your Own!

P.1

P.2

Part 2

2.1

2.2

2.3

2.4

Practice on Your Own!

P.3

P.4

P.5