In this module, we will show you how to:
- Reshape data from wide (fat) to long (tall)
- Reshape data from long (tall) to wide (fat)
- Merge Data/Joins
In this module, we will show you how to:
https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf
https://github.com/gadenbuie/tidyexplain/blob/main/images/tidyr-pivoting.gif
Data is stored differently in the tibble.
Wide: has many columns
# A tibble: 1 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4%
Long: column names become data
# A tibble: 3 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4%
Wide: multiple columns per individual, values spread across multiple columns
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4% 2 Alaska 47.5% 46.2% 41.7%
Long: multiple rows per observation, a single column contains the values
# A tibble: 6 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4% 4 Alaska June_vacc_rate 47.5% 5 Alaska May_vacc_rate 46.2% 6 Alaska April_vacc_rate 41.7%
Data is wide or long with respect to certain variables.
Wide: Easier for humans to read
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 32.4% 2 Alaska 47.5% 46.2% 41.7%
Long: Easier for R to make plots & do analysis
# A tibble: 6 x 3 State name value <chr> <chr> <chr> 1 Alabama June_vacc_rate 37.2% 2 Alabama May_vacc_rate 36.0% 3 Alabama April_vacc_rate 32.4% 4 Alaska June_vacc_rate 47.5% 5 Alaska May_vacc_rate 46.2% 6 Alaska April_vacc_rate 41.7%
tidyr
allows you to “tidy” your data. We will be talking about:
pivot_longer
- make multiple columns into variables, (wide to long)pivot_wider
- make a variable into multiple columns, (long to wide)separate
- string into multiple columns (review)The reshape
command exists. It is a confusing function. Don’t use it.
pivot_longer
…tidyr::pivot_longer
- puts column data into rows.
names_to =
gives a new name to the pivoted columnsvalues_to =
gives a new name to the values that used to be in those columns{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot}, names_to = {New column name: contains old column names}, values_to = {New column name: contains cell values})
wide_data
# A tibble: 1 x 3 June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> 1 37.2% 36.0% 32.4%
long_data <- wide_data %>% pivot_longer(cols = everything(), names_to = "Month", values_to = "Rate") long_data
# A tibble: 3 x 2 Month Rate <chr> <chr> 1 June_vacc_rate 37.2% 2 May_vacc_rate 36.0% 3 April_vacc_rate 32.4%
http://jhudatascience.org/intro_to_R_class/data/Charm_City_Circulator_Ridership.csv
circ <- jhur::read_circulator() head(circ, 5)
# A tibble: 5 x 15 day date orangeBoardings orangeAlightings orangeAverage purpleBoardings <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Monday 01/11/… 877 1027 952 NA 2 Tuesday 01/12/… 777 815 796 NA 3 Wednes… 01/13/… 1203 1220 1212. NA 4 Thursd… 01/14/… 1194 1233 1214. NA 5 Friday 01/15/… 1645 1643 1644 NA # … with 9 more variables: purpleAlightings <dbl>, purpleAverage <dbl>, # greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>, # bannerBoardings <dbl>, bannerAlightings <dbl>, bannerAverage <dbl>, # daily <dbl>
long <- circ %>% pivot_longer(starts_with(c("orange","purple","green","banner")), names_to = "var", values_to = "number") long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # … with 13,742 more rows
There are many ways to select the columns we want. Use ?tidyr_tidy_select
to look at more column selection options.
long <- circ %>% pivot_longer( !c(day, date, daily), names_to = "var", values_to = "number") long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orangeBoardings 877 2 Monday 01/11/2010 952 orangeAlightings 1027 3 Monday 01/11/2010 952 orangeAverage 952 4 Monday 01/11/2010 952 purpleBoardings NA 5 Monday 01/11/2010 952 purpleAlightings NA 6 Monday 01/11/2010 952 purpleAverage NA 7 Monday 01/11/2010 952 greenBoardings NA 8 Monday 01/11/2010 952 greenAlightings NA 9 Monday 01/11/2010 952 greenAverage NA 10 Monday 01/11/2010 952 bannerBoardings NA # … with 13,742 more rows
long %>% count(var)
# A tibble: 12 x 2 var n <chr> <int> 1 bannerAlightings 1146 2 bannerAverage 1146 3 bannerBoardings 1146 4 greenAlightings 1146 5 greenAverage 1146 6 greenBoardings 1146 7 orangeAlightings 1146 8 orangeAverage 1146 9 orangeBoardings 1146 10 purpleAlightings 1146 11 purpleAverage 1146 12 purpleBoardings 1146
We will use str_replace
from the stringr
package to put _
in the names
long <- long %>% mutate( var = str_replace(var, "Board", "_Board"), var = str_replace(var, "Alight", "_Alight"), var = str_replace(var, "Average", "_Average") ) long
# A tibble: 13,752 x 5 day date daily var number <chr> <chr> <dbl> <chr> <dbl> 1 Monday 01/11/2010 952 orange_Boardings 877 2 Monday 01/11/2010 952 orange_Alightings 1027 3 Monday 01/11/2010 952 orange_Average 952 4 Monday 01/11/2010 952 purple_Boardings NA 5 Monday 01/11/2010 952 purple_Alightings NA 6 Monday 01/11/2010 952 purple_Average NA 7 Monday 01/11/2010 952 green_Boardings NA 8 Monday 01/11/2010 952 green_Alightings NA 9 Monday 01/11/2010 952 green_Average NA 10 Monday 01/11/2010 952 banner_Boardings NA # … with 13,742 more rows
Now each var
is Boardings, Averages, or Alightings. We use “into =
” to name the new columns and “sep =
” to show where the separation should happen.
long <- long %>% separate(var, into = c("line", "type"), sep = "_") long
# A tibble: 13,752 x 6 day date daily line type number <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Monday 01/11/2010 952 orange Boardings 877 2 Monday 01/11/2010 952 orange Alightings 1027 3 Monday 01/11/2010 952 orange Average 952 4 Monday 01/11/2010 952 purple Boardings NA 5 Monday 01/11/2010 952 purple Alightings NA 6 Monday 01/11/2010 952 purple Average NA 7 Monday 01/11/2010 952 green Boardings NA 8 Monday 01/11/2010 952 green Alightings NA 9 Monday 01/11/2010 952 green Average NA 10 Monday 01/11/2010 952 banner Boardings NA # … with 13,742 more rows
pivot_wider
…tidyr::pivot_wider
- spreads row data into columns.
names_from =
the old column whose contents will be spread into multiple new column names.values_from =
the old column whose contents will fill in the values of those new columns.{wide_data} <- {long_data} %>% pivot_wider(names_from = {Old column name: contains new column names}, values_from = {Old column name: contains new cell values})
long_data
# A tibble: 3 x 2 Month Rate <chr> <chr> 1 June_vacc_rate 37.2% 2 May_vacc_rate 36.0% 3 April_vacc_rate 32.4%
wide_data <- long_data %>% pivot_wider(names_from = "Month", values_from = "Rate") wide_data
# A tibble: 1 x 3 June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> 1 37.2% 36.0% 32.4%
long
# A tibble: 13,752 x 6 day date daily line type number <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Monday 01/11/2010 952 orange Boardings 877 2 Monday 01/11/2010 952 orange Alightings 1027 3 Monday 01/11/2010 952 orange Average 952 4 Monday 01/11/2010 952 purple Boardings NA 5 Monday 01/11/2010 952 purple Alightings NA 6 Monday 01/11/2010 952 purple Average NA 7 Monday 01/11/2010 952 green Boardings NA 8 Monday 01/11/2010 952 green Alightings NA 9 Monday 01/11/2010 952 green Average NA 10 Monday 01/11/2010 952 banner Boardings NA # … with 13,742 more rows
wide <- long %>% pivot_wider(names_from = "type", values_from = "number") wide
# A tibble: 4,584 x 7 day date daily line Boardings Alightings Average <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> 1 Monday 01/11/2010 952 orange 877 1027 952 2 Monday 01/11/2010 952 purple NA NA NA 3 Monday 01/11/2010 952 green NA NA NA 4 Monday 01/11/2010 952 banner NA NA NA 5 Tuesday 01/12/2010 796 orange 777 815 796 6 Tuesday 01/12/2010 796 purple NA NA NA 7 Tuesday 01/12/2010 796 green NA NA NA 8 Tuesday 01/12/2010 796 banner NA NA NA 9 Wednesday 01/13/2010 1212. orange 1203 1220 1212. 10 Wednesday 01/13/2010 1212. purple NA NA NA # … with 4,574 more rows
dplyr
?join
- see different types of joining for dplyr
inner_join(x, y)
- only rows that match for x
and y
are keptfull_join(x, y)
- all rows of x
and y
are keptleft_join(x, y)
- all rows of x
are kept even if not merged with y
right_join(x, y)
- all rows of y
are kept even if not merged with x
anti_join(x, y)
- all rows from x
not in y
keeping just columns from x
.data_As
# A tibble: 2 x 3 State June_vacc_rate May_vacc_rate <chr> <chr> <chr> 1 Alabama 37.2% 36.0% 2 Alaska 47.5% 46.2%
data_cold
# A tibble: 2 x 2 State April_vacc_rate <chr> <chr> 1 Maine 32.4% 2 Alaska 41.7%
https://github.com/gadenbuie/tidyexplain/blob/main/images/inner-join.gif
ij = inner_join(data_As, data_cold)
Joining, by = "State"
ij
# A tibble: 1 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alaska 47.5% 46.2% 41.7%
https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/left-join.gif
lj = left_join(data_As, data_cold)
Joining, by = "State"
lj
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% <NA> 2 Alaska 47.5% 46.2% 41.7%
tidylog
package to log outputs# install.packages("tidylog") library(tidylog) left_join(data_As, data_cold)
Joining, by = "State"
left_join: added one column (April_vacc_rate)
> rows only in x 1
> rows only in y (1)
> matched rows 1
> ===
> rows total 2
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% <NA> 2 Alaska 47.5% 46.2% 41.7%
https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/right-join.gif
rj <- right_join(data_As, data_cold)
Joining, by = "State"
right_join: added one column (April_vacc_rate)
> rows only in x (1)
> rows only in y 1
> matched rows 1
> ===
> rows total 2
rj
# A tibble: 2 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alaska 47.5% 46.2% 41.7% 2 Maine <NA> <NA> 32.4%
lj2 <- left_join(data_cold, data_As)
Joining, by = "State"
left_join: added 2 columns (June_vacc_rate, May_vacc_rate)
> rows only in x 1
> rows only in y (1)
> matched rows 1
> ===
> rows total 2
lj2
# A tibble: 2 x 4 State April_vacc_rate June_vacc_rate May_vacc_rate <chr> <chr> <chr> <chr> 1 Maine 32.4% <NA> <NA> 2 Alaska 41.7% 47.5% 46.2%
https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/full-join.gif
fj <- full_join(data_As, data_cold)
Joining, by = "State"
full_join: added one column (April_vacc_rate)
> rows only in x 1
> rows only in y 1
> matched rows 1
> ===
> rows total 3
fj
# A tibble: 3 x 4 State June_vacc_rate May_vacc_rate April_vacc_rate <chr> <chr> <chr> <chr> 1 Alabama 37.2% 36.0% <NA> 2 Alaska 47.5% 46.2% 41.7% 3 Maine <NA> <NA> 32.4%
includes duplicates
”data_As
# A tibble: 2 x 2 State state_bird <chr> <chr> 1 Alabama wild turkey 2 Alaska williow ptarmigan
data_cold
# A tibble: 3 x 3 State vacc_rate month <chr> <chr> <chr> 1 Maine 32.4% April 2 Alaska 41.7% April 3 Alaska 46.2% May
includes duplicates
”lj <- left_join(data_As, data_cold)
Joining, by = "State"
left_join: added 2 columns (vacc_rate, month)
> rows only in x 1
> rows only in y (1)
> matched rows 2 (includes duplicates)
> ===
> rows total 3
includes duplicates
”Data including the joining column (“State”) has been duplicated.
lj
# A tibble: 3 x 4 State state_bird vacc_rate month <chr> <chr> <chr> <chr> 1 Alabama wild turkey <NA> <NA> 2 Alaska williow ptarmigan 41.7% April 3 Alaska williow ptarmigan 46.2% May
Note that “Alaska willow ptarmigan” appears twice.
includes duplicates
”https://github.com/gadenbuie/tidyexplain/blob/main/images/left-join-extra.gif
tidylog
unloadNamespace("tidylog")
duplicated
function can give you indications if there are duplicates in a vector:duplicated(1:5)
[1] FALSE FALSE FALSE FALSE FALSE
duplicated(c(1:5, 1))
[1] FALSE FALSE FALSE FALSE FALSE TRUE
lj %>% mutate(dup_State = duplicated(State))
# A tibble: 3 x 5 State state_bird vacc_rate month dup_State <chr> <chr> <chr> <chr> <lgl> 1 Alabama wild turkey <NA> <NA> FALSE 2 Alaska williow ptarmigan 41.7% April FALSE 3 Alaska williow ptarmigan 46.2% May TRUE
by
argumentBy default joins use the intersection of column names. If by
is specified, it uses that.
full_join(data_As, data_cold, by = "State")
# A tibble: 4 x 4 State state_bird vacc_rate month <chr> <chr> <chr> <chr> 1 Alabama wild turkey <NA> <NA> 2 Alaska williow ptarmigan 41.7% April 3 Alaska williow ptarmigan 46.2% May 4 Maine <NA> 32.4% April
by
argumentYou can join based on multiple columns by using something like by = c(col1, col2)
.
If the datasets have two different names for the same data, use:
full_join(data_As, data_cold, by = c("a" = "b"))
setdiff
”We might want to determine what indexes ARE in the first dataset that AREN’T in the second:
data_As
# A tibble: 2 x 2 State state_bird <chr> <chr> 1 Alabama wild turkey 2 Alaska williow ptarmigan
data_cold
# A tibble: 3 x 3 State vacc_rate month <chr> <chr> <chr> 1 Maine 32.4% April 2 Alaska 41.7% April 3 Alaska 46.2% May
setdiff
”Use setdiff
to determine what indexes ARE in the first dataset that AREN’T in the second:
A_states <- data_As %>% pull(State) cold_states <- data_cold %>% pull(State)
setdiff(A_states, cold_states)
[1] "Alabama"
setdiff(cold_states, A_states)
[1] "Maine"