When can you use a function directly on a column of a tibble and when do you need a vector in R?

There are some inconsistencies across different functions. Typically if it is a base R function, you often need to pull the data out into a vector first. Sometimes you just need to try them out. Always check that your work did what you expect!

In general

In general these functions need to use vectors:

Type Example
base R and tidyverse function setdiff()
special tidyverse functions stringr functions like str_replace()
base R functions mean()
stat functions that come with R cor.test()


Click below to take a look at each in more detail.


setdiff() function

This function was originally a base R function and it got adopted by tidyverse packages (dplyr and lubridate) and is therefore can behave a bit differently. This is confusing and part of the struggle of an open development environment.

When using setdiff() from base R, vectors are required - meaning we need to use pull first. It shows the elements in the first that are missing from the second.

Using :: we can specify which package we want to use the function from.

data_As <- tibble(State = c("Alabama", "Alaska"),
                 state_bird = c("wild turkey", "willow ptarmigan"))
data_cold <- tibble(State = c("Maine", "Alaska", "Alaska"),
                    vacc_rate = c(0.795, 0.623, 0.626),
                    month = c("April", "April", "May"))

A_states_vector <- data_As %>% pull(State)
cold_states_vector <- data_cold %>% pull(State)

A_states_vector
## [1] "Alabama" "Alaska"
cold_states_vector
## [1] "Maine"  "Alaska" "Alaska"
base::setdiff(A_states_vector, cold_states_vector)
## [1] "Alabama"

“Alabama” is in the first vector, but not in the second vector.

If we do this with the dplyr version this still works the same:

dplyr::setdiff(A_states_vector, cold_states_vector)
## [1] "Alabama"

We can select the State variable to compare (keeping it as a tibble) and it still works as the above with the dplyr version, even though A_states and cold_states are tibbles and not vectors. (Remember select() creates a smaller tibble and pull() creates a vector of the values.) In this case we see the rows in the first tibble that are not in the second.

A_states_tibble <- data_As %>% select(State)
A_states_tibble
## # A tibble: 2 × 1
##   State  
##   <chr>  
## 1 Alabama
## 2 Alaska
cold_states_tibble <- data_cold %>% select(State)
cold_states_tibble
## # A tibble: 3 × 1
##   State 
##   <chr> 
## 1 Maine 
## 2 Alaska
## 3 Alaska
dplyr::setdiff(A_states_tibble, cold_states_tibble)
## # A tibble: 1 × 1
##   State  
##   <chr>  
## 1 Alabama

However when using setdiff() from base R with the tibble versions of the data, it does not work properly and just gives us the first tibble.

base::setdiff(A_states_tibble, cold_states_tibble) 
## $State
## [1] "Alabama" "Alaska"
base::setdiff(cold_states_tibble, A_states_tibble) 
## $State
## [1] "Maine"  "Alaska" "Alaska"

We could also use setdiff() from dplyr to tell us what rows were removed when filtering a dataframe or tibble:

mt_cars_high_mpg <- mtcars %>% filter(mpg > 20)
dplyr::setdiff(mtcars, mt_cars_high_mpg)
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Here we see all the rows with mpg below 20.

stringr functions

These functions are often applied within filter() or mutate() for a data frame. When they are not used inside these functions they need to be used on a vector.

iris %>% filter(str_detect(string = Species, pattern = "set")) %>% head() # this will work
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
iris %>% pull(Species) %>% str_detect(pattern = "set") %>% head() # so will this
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

This however, would not work well:

iris %>% select(Species) %>% str_detect(pattern = "set")

base R math functions

Functions like sum() or mean() need vectors. We can use them well within summarize but would need to use a vector otherwise.

iris %>% summarize(mean_Petal_Length = mean(Petal.Length))
##   mean_Petal_Length
## 1             3.758
iris %>% pull(Petal.Length) %>% mean()
## [1] 3.758

This does not work:

iris %>% select(Petal.Length) %>% mean()

stats functions

Functions like cor.test of the stats package (which come automatically with R) need vectors too. Note that some of the other stats functions are tolerant to using tibbles.

x <-iris %>% pull(Petal.Length)
y <-iris %>% pull(Petal.Width)

cor.test(x,y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 43.387, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9490525 0.9729853
## sample estimates:
##       cor 
## 0.9628654

This does not work:

x <-iris %>% select(Petal.Length)
y <-iris %>% select(Petal.Width)

cor.test(x,y)