In this module, we will show you how to:
- Reshape data from wide (fat) to long (tall)
- Reshape data from long (tall) to wide (fat)
- Merge Data/Joins
In this module, we will show you how to:
https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf
https://github.com/gadenbuie/tidyexplain/blob/main/images/tidyr-pivoting.gif
Data is stored differently in the tibble.
Wide: has many columns
# A tibble: 1 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4%
Long: column names become data
# A tibble: 3 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4%
Wide: multiple columns per individual, values spread across multiple columns
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4% 2 Alaska 47.5% 46.2% 41.7%
Long: multiple rows per observation, a single column contains the values
# A tibble: 6 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4% 4 Alaska June_vacc_rate 47.5% 5 Alaska May_vacc_rate 46.2% 6 Alaska April_vacc_rate 41.7%
Data is wide or long with respect to certain variables.
Wide: Easier for humans to read
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4% 2 Alaska 47.5% 46.2% 41.7%
Long: Easier for R to make plots & do analysis
# A tibble: 6 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4% 4 Alaska June_vacc_rate 47.5% 5 Alaska May_vacc_rate 46.2% 6 Alaska April_vacc_rate 41.7%
tidyr allows you to “tidy” your data. We will be talking about:
pivot_longer - make multiple columns into variables, (wide to long)pivot_wider - make a variable into multiple columns, (long to wide)separate - string into multiple columns (review)The reshape command exists. It is a confusing function. Don’t use it.
pivot_longer…tidyr::pivot_longer - puts column data into rows.
names_to = gives a new name to the pivoted columnsvalues_to = gives a new name to the values that used to be in those columns{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot},
names_to = {New column name: contains old column names},
values_to = {New column name: contains cell values})
wide_data
# A tibble: 1 x 3 June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> 1 37.2% 36.0% 32.4%
long_data <- wide_data %>% pivot_longer(cols = everything(),
names_to = "Month",
values_to = "Rate")
long_data
# A tibble: 3 x 2 Month Rate <chr> <chr> 1 June_vacc_rate 37.2% 2 May_vacc_rate 36.0% 3 April_vacc_rate 32.4%
http://jhudatascience.org/intro_to_R_class/data/Charm_City_Circulator_Ridership.csv
circ <- jhur::read_circulator() head(circ, 5)
# A tibble: 5 x 15 day date orangeBoardings orangeAlightings orangeAverage purpleBoardings <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Monday 01/11/… 877 1027 952 NA 2 Tuesday 01/12/… 777 815 796 NA 3 Wednes… 01/13/… 1203 1220 1212. NA 4 Thursd… 01/14/… 1194 1233 1214. NA 5 Friday 01/15/… 1645 1643 1644 NA # … with 9 more variables: purpleAlightings <dbl>, purpleAverage <dbl>, # greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>, # bannerBoardings <dbl>, bannerAlightings <dbl>, bannerAverage <dbl>, # daily <dbl>
long <- circ %>%
pivot_longer(starts_with(c("orange","purple","green","banner")),
names_to = "var",
values_to = "number")
long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # … with 13,742 more rows
There are many ways to select the columns we want. Use ?tidyr_tidy_select to look at more column selection options.
long <- circ %>%
pivot_longer( !c(day, date, daily),
names_to = "var",
values_to = "number")
long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # … with 13,742 more rows
long %>% count(var)
# A tibble: 12 x 2 var n <chr> <int> 1 bannerAlightings 1146 2 bannerAverage 1146 3 bannerBoardings 1146 4 greenAlightings 1146 5 greenAverage 1146 6 greenBoardings 1146 7 orangeAlightings 1146 8 orangeAverage 1146 9 orangeBoardings 1146 10 purpleAlightings 1146 11 purpleAverage 1146 12 purpleBoardings 1146
We will use str_replace from the stringr package to put _ in the names
long <- long %>% mutate( var = str_replace(var, "Board", "_Board"), var = str_replace(var, "Alight", "_Alight"), var = str_replace(var, "Average", "_Average") ) long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orange_Boardings 877 2 Monday 01/11/2010 952 orange_Alightings 1027 3 Monday 01/11/2010 952 orange_Average 952 4 Monday 01/11/2010 952 purple_Boardings NA 5 Monday 01/11/2010 952 purple_Alightings NA 6 Monday 01/11/2010 952 purple_Average NA 7 Monday 01/11/2010 952 green_Boardings NA 8 Monday 01/11/2010 952 green_Alightings NA 9 Monday 01/11/2010 952 green_Average NA 10 Monday 01/11/2010 952 banner_Boardings NA # … with 13,742 more rows
Now each var is Boardings, Averages, or Alightings. We use “into =” to name the new columns and “sep =” to show where the separation should happen.
long <- long %>%
separate(var, into = c("line", "type"), sep = "_")
long
# A tibble: 13,752 x 6 day date daily line type number <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Monday 01/11/2010 952 orange Boardings 877 2 Monday 01/11/2010 952 orange Alightings 1027 3 Monday 01/11/2010 952 orange Average 952 4 Monday 01/11/2010 952 purple Boardings NA 5 Monday 01/11/2010 952 purple Alightings NA 6 Monday 01/11/2010 952 purple Average NA 7 Monday 01/11/2010 952 green Boardings NA 8 Monday 01/11/2010 952 green Alightings NA 9 Monday 01/11/2010 952 green Average NA 10 Monday 01/11/2010 952 banner Boardings NA # … with 13,742 more rows
pivot_wider…tidyr::pivot_wider - spreads row data into columns.
names_from = the old column whose contents will be spread into multiple new column names.values_from = the old column whose contents will fill in the values of those new columns.{wide_data} <- {long_data} %>%
pivot_wider(names_from = {Old column name: contains new column names},
values_from = {Old column name: contains new cell values})
long_data
# A tibble: 3 x 2 Month Rate <chr> <chr> 1 June_vacc_rate 37.2% 2 May_vacc_rate 36.0% 3 April_vacc_rate 32.4%
wide_data <- long_data %>% pivot_wider(names_from = "Month",
values_from = "Rate")
wide_data
# A tibble: 1 x 3 June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> 1 37.2% 36.0% 32.4%
long
# A tibble: 13,752 x 6 day date daily line type number <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Monday 01/11/2010 952 orange Boardings 877 2 Monday 01/11/2010 952 orange Alightings 1027 3 Monday 01/11/2010 952 orange Average 952 4 Monday 01/11/2010 952 purple Boardings NA 5 Monday 01/11/2010 952 purple Alightings NA 6 Monday 01/11/2010 952 purple Average NA 7 Monday 01/11/2010 952 green Boardings NA 8 Monday 01/11/2010 952 green Alightings NA 9 Monday 01/11/2010 952 green Average NA 10 Monday 01/11/2010 952 banner Boardings NA # … with 13,742 more rows
wide <- long %>% pivot_wider(names_from = "type",
values_from = "number")
wide
# A tibble: 4,584 x 7 day date daily line Boardings Alightings Average <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 Monday 01/11/2010 952 orange 877 1027 952 2 Monday 01/11/2010 952 purple NA NA NA 3 Monday 01/11/2010 952 green NA NA NA 4 Monday 01/11/2010 952 banner NA NA NA 5 Tuesday 01/12/2010 796 orange 777 815 796 6 Tuesday 01/12/2010 796 purple NA NA NA 7 Tuesday 01/12/2010 796 green NA NA NA 8 Tuesday 01/12/2010 796 banner NA NA NA 9 Wednesday 01/13/2010 1212. orange 1203 1220 1212. 10 Wednesday 01/13/2010 1212. purple NA NA NA # … with 4,574 more rows
dplyr?join - see different types of joining for dplyrinner_join(x, y) - only rows that match for x and y are keptfull_join(x, y) - all rows of x and y are keptleft_join(x, y) - all rows of x are kept even if not merged with yright_join(x, y) - all rows of y are kept even if not merged with xanti_join(x, y) - all rows from x not in y keeping just columns from x.data_As
# A tibble: 2 x 3 State June_vacc_rate May_vacc_rate <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 2 Alaska 47.5% 46.2%
data_cold
# A tibble: 2 x 2 State April_vacc_rate <chr> <chr> 1 Maine 32.4% 2 Alaska 41.7%
https://github.com/gadenbuie/tidyexplain/blob/main/images/inner-join.gif
ij = inner_join(data_As, data_cold)
Joining, by = "State"
ij
# A tibble: 1 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alaska 47.5% 46.2% 41.7%
https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/left-join.gif
lj = left_join(data_As, data_cold)
Joining, by = "State"
lj
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% <NA> 2 Alaska 47.5% 46.2% 41.7%
tidylog package to log outputs# install.packages("tidylog")
library(tidylog)
left_join(data_As, data_cold)
Joining, by = "State"
left_join: added one column (April_vacc_rate)
> rows only in x 1
> rows only in y (1)
> matched rows 1
> ===
> rows total 2
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% <NA> 2 Alaska 47.5% 46.2% 41.7%
https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/right-join.gif
rj <- right_join(data_As, data_cold)
Joining, by = "State"
right_join: added one column (April_vacc_rate)
> rows only in x (1)
> rows only in y 1
> matched rows 1
> ===
> rows total 2
rj
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alaska 47.5% 46.2% 41.7% 2 Maine <NA> <NA> 32.4%
lj2 <- left_join(data_cold, data_As)
Joining, by = "State"
left_join: added 2 columns (June_vacc_rate, May_vacc_rate)
> rows only in x 1
> rows only in y (1)
> matched rows 1
> ===
> rows total 2
lj2
# A tibble: 2 x 4 State April_vacc_rate June_vacc_rate May_vacc_rate <chr> <chr> <chr> <chr> 1 Maine 32.4% <NA> <NA> 2 Alaska 41.7% 47.5% 46.2%
https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/full-join.gif
fj <- full_join(data_As, data_cold)
Joining, by = "State"
full_join: added one column (April_vacc_rate)
> rows only in x 1
> rows only in y 1
> matched rows 1
> ===
> rows total 3
fj
# A tibble: 3 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% <NA> 2 Alaska 47.5% 46.2% 41.7% 3 Maine <NA> <NA> 32.4%
includes duplicates”data_As
# A tibble: 2 x 2 State state_bird <chr> <chr> 1 Alabama wild turkey 2 Alaska williow ptarmigan
data_cold
# A tibble: 3 x 3 State vacc_rate month <chr> <chr> <chr> 1 Maine 32.4% April 2 Alaska 41.7% April 3 Alaska 46.2% May
includes duplicates”lj <- left_join(data_As, data_cold)
Joining, by = "State"
left_join: added 2 columns (vacc_rate, month)
> rows only in x 1
> rows only in y (1)
> matched rows 2 (includes duplicates)
> ===
> rows total 3
includes duplicates”Data including the joining column (“State”) has been duplicated.
lj
# A tibble: 3 x 4 State state_bird vacc_rate month <chr> <chr> <chr> <chr> 1 Alabama wild turkey <NA> <NA> 2 Alaska williow ptarmigan 41.7% April 3 Alaska williow ptarmigan 46.2% May
Note that “Alaska willow ptarmigan” appears twice.
includes duplicates”https://github.com/gadenbuie/tidyexplain/blob/main/images/left-join-extra.gif
tidylogunloadNamespace("tidylog")
duplicated function can give you indications if there are duplicates in a vector:duplicated(1:5)
[1] FALSE FALSE FALSE FALSE FALSE
duplicated(c(1:5, 1))
[1] FALSE FALSE FALSE FALSE FALSE TRUE
lj %>% mutate(dup_State = duplicated(State))
# A tibble: 3 x 5 State state_bird vacc_rate month dup_State <chr> <chr> <chr> <chr> <lgl> 1 Alabama wild turkey <NA> <NA> FALSE 2 Alaska williow ptarmigan 41.7% April FALSE 3 Alaska williow ptarmigan 46.2% May TRUE
by argumentBy default joins use the intersection of column names. If by is specified, it uses that.
full_join(data_As, data_cold, by = "State")
# A tibble: 4 x 4 State state_bird vacc_rate month <chr> <chr> <chr> <chr> 1 Alabama wild turkey <NA> <NA> 2 Alaska williow ptarmigan 41.7% April 3 Alaska williow ptarmigan 46.2% May 4 Maine <NA> 32.4% April
by argumentYou can join based on multiple columns by using something like by = c(col1, col2).
If the datasets have two different names for the same data, use:
full_join(data_As, data_cold, by = c("a" = "b"))
setdiff”We might want to determine what indexes ARE in the first dataset that AREN’T in the second:
data_As
# A tibble: 2 x 2 State state_bird <chr> <chr> 1 Alabama wild turkey 2 Alaska williow ptarmigan
data_cold
# A tibble: 3 x 3 State vacc_rate month <chr> <chr> <chr> 1 Maine 32.4% April 2 Alaska 41.7% April 3 Alaska 46.2% May
setdiff”Use setdiff to determine what indexes ARE in the first dataset that AREN’T in the second:
A_states <- data_As %>% pull(State) cold_states <- data_cold %>% pull(State)
setdiff(A_states, cold_states)
[1] "Alabama"
setdiff(cold_states, A_states)
[1] "Maine"