1. Load all the libraries we will use in this lab.
library(dplyr)
library(ggplot2)
  1. Load the Youth Tobacco Survey data (using jhur::read_yts()). select “Sample_Size”, “Education”, and “LocationAbbr”. Name this data “yts”.
yts <- jhur::read_yts() %>% select(Sample_Size, Education, LocationAbbr)
# Alt:
# yts <- read_csv("http://jhudatascience.org/intro_to_R_class/data/Youth_Tobacco_Survey_YTS_Data.csv")
  1. Create a boxplot showing the difference in “Sample_Size” between Middle School and High School “Education”. Hint: Use aes(x = Education, y = Sample_Size) and geom_boxplot().
yts %>%
  ggplot(aes(x = Education, y = Sample_Size)) +
  geom_boxplot()
## Warning: Removed 425 rows containing non-finite values (stat_boxplot).

  1. Use group_by and tally to count up the number of lines of data for each “Education” group.
yts %>% group_by(Education) %>% tally()
## # A tibble: 2 x 2
##   Education         n
##   <chr>         <int>
## 1 High School    4588
## 2 Middle School  5206
  1. Make “Education” a factor using the mutate and factor functions. Use the levels argument inside factor to reorder “Education”. Reorder this variable so that “Middle School” comes before “High School”. Assign the output the name “yts_fct”.
yts_fct <-
  yts %>% mutate(Education = factor(Education, levels = c("Middle School", "High School")))
  1. Repeat #2 and #3 using the “yts_fct” data. You should see different ordering in the plot and tally table.
yts_fct %>%
  ggplot(aes(x = Education, y = Sample_Size)) +
  geom_boxplot()
## Warning: Removed 425 rows containing non-finite values (stat_boxplot).

yts_fct %>% group_by(Education) %>% tally()
## # A tibble: 2 x 2
##   Education         n
##   <fct>         <int>
## 1 Middle School  5206
## 2 High School    4588

BONUS

  1. Convert “LocationAbbr” (state) in “yts_fct” into a factor using the mutate and factor functions. Do not add a levels = argument.
yts_fct <- yts_fct %>% mutate(LocationAbbr = factor(LocationAbbr))
  1. We want to create a new column that contains the group-level median sample size.
yts_fct <- yts_fct %>% group_by(LocationAbbr) %>% mutate(med_sample_size = median(Sample_Size, na.rm = TRUE))
  1. We want to plot the “LocationAbbr” (state) by the “med_sample_size” colum we created above. Using the forcats package, create a plot that:
library(forcats)

yts_fct %>%
  ggplot(mapping = aes(x = fct_reorder(LocationAbbr, med_sample_size),
                       y = Sample_Size)) +
  geom_boxplot()
## Warning: Removed 425 rows containing non-finite values (stat_boxplot).