When can you use a function directly on a column of a tibble and when do you need a vector in R?
There are some inconsistencies across different functions. Typically if it is a base R function, you often need to pull the data out into a vector first. Sometimes you just need to try them out. Always check that your work did what you expect!
In general these functions need to use vectors:
Type | Example |
---|---|
base R and tidyverse function | setdiff() |
special tidyverse functions | stringr functions like str_replace() |
base R functions | mean() |
stat functions that come with R | cor.test() |
Click below to take a look at each in more detail.
setdiff()
functionThis function was originally a base R function and it got adopted by
tidyverse packages (dplyr
and lubridate
) and
is therefore can behave a bit differently. This is confusing and part of
the struggle of an open development environment.
When using setdiff()
from base R, vectors are required -
meaning we need to use pull
first. It shows the
elements in the first that are missing from the
second.
Using ::
we can specify which package we want to use the
function from.
data_As <- tibble(State = c("Alabama", "Alaska"),
state_bird = c("wild turkey", "willow ptarmigan"))
data_cold <- tibble(State = c("Maine", "Alaska", "Alaska"),
vacc_rate = c(0.795, 0.623, 0.626),
month = c("April", "April", "May"))
A_states_vector <- data_As %>% pull(State)
cold_states_vector <- data_cold %>% pull(State)
A_states_vector
## [1] "Alabama" "Alaska"
cold_states_vector
## [1] "Maine" "Alaska" "Alaska"
base::setdiff(A_states_vector, cold_states_vector)
## [1] "Alabama"
“Alabama” is in the first vector, but not in the second vector.
If we do this with the dplyr
version this still works
the same:
dplyr::setdiff(A_states_vector, cold_states_vector)
## [1] "Alabama"
We can select the State
variable to compare (keeping it
as a tibble) and it still works as the above with the dplyr
version, even though A_states
and cold_states
are tibbles and not vectors. (Remember select()
creates a
smaller tibble and pull()
creates a vector of the values.)
In this case we see the rows in the first tibble that are not in the
second.
A_states_tibble <- data_As %>% select(State)
A_states_tibble
## # A tibble: 2 × 1
## State
## <chr>
## 1 Alabama
## 2 Alaska
cold_states_tibble <- data_cold %>% select(State)
cold_states_tibble
## # A tibble: 3 × 1
## State
## <chr>
## 1 Maine
## 2 Alaska
## 3 Alaska
dplyr::setdiff(A_states_tibble, cold_states_tibble)
## # A tibble: 1 × 1
## State
## <chr>
## 1 Alabama
However when using setdiff()
from base R with the tibble
versions of the data, it does not work properly and just gives us the
first tibble.
base::setdiff(A_states_tibble, cold_states_tibble)
## $State
## [1] "Alabama" "Alaska"
base::setdiff(cold_states_tibble, A_states_tibble)
## $State
## [1] "Maine" "Alaska" "Alaska"
We could also use setdiff()
from dplyr
to
tell us what rows were removed when filtering a dataframe or tibble:
mt_cars_high_mpg <- mtcars %>% filter(mpg > 20)
dplyr::setdiff(mtcars, mt_cars_high_mpg)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Here we see all the rows with mpg
below 20.
stringr
functionsThese functions are often applied within filter()
or
mutate()
for a data frame. When they are not used inside
these functions they need to be used on a vector.
iris %>% filter(str_detect(string = Species, pattern = "set")) %>% head() # this will work
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris %>% pull(Species) %>% str_detect(pattern = "set") %>% head() # so will this
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
This however, would not work well:
iris %>% select(Species) %>% str_detect(pattern = "set")
base
R math functionsFunctions like sum()
or mean()
need
vectors. We can use them well within summarize
but would
need to use a vector otherwise.
iris %>% summarize(mean_Petal_Length = mean(Petal.Length))
## mean_Petal_Length
## 1 3.758
iris %>% pull(Petal.Length) %>% mean()
## [1] 3.758
This does not work:
iris %>% select(Petal.Length) %>% mean()
stats
functionsFunctions like cor.test
of the stats
package (which come automatically with R) need vectors too. Note that
some of the other stats
functions are tolerant to using
tibbles.
x <-iris %>% pull(Petal.Length)
y <-iris %>% pull(Petal.Width)
cor.test(x,y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 43.387, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9490525 0.9729853
## sample estimates:
## cor
## 0.9628654
This does not work:
x <-iris %>% select(Petal.Length)
y <-iris %>% select(Petal.Width)
cor.test(x,y)