Load all the libraries we will use in this lab.

library(tidyverse)

1.0

Load the Youth Tobacco Survey data from http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv. select “Sample_Size”, “Education”, and “LocationAbbr”. Name this data “yts”.

yts <- read_csv("http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv")
## Rows: 9794 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
## dbl  (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
yts <- yts %>% select(Sample_Size, Education, LocationAbbr)

1.1

Create a boxplot showing the difference in “Sample_Size” between Middle School and High School “Education”. Hint: Use aes(x = Education, y = Sample_Size) and geom_boxplot().

yts %>%
  ggplot(aes(x = Education, y = Sample_Size)) +
  geom_boxplot()
## Warning: Removed 425 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

1.2

Use count to count up the number of observations of data for each “Education” group.

yts %>%
  count(Education)
## # A tibble: 2 × 2
##   Education         n
##   <chr>         <int>
## 1 High School    4588
## 2 Middle School  5206

1.3

Make “Education” a factor using the mutate and factor functions. Use the levels argument inside factor to reorder “Education”. Reorder this variable so that “Middle School” comes before “High School”. Assign the output the name “yts_fct”.

yts_fct <-
  yts %>% mutate(Education = factor(Education,
    levels = c("Middle School", "High School")
  ))

1.4

Repeat question 1.1 and 1.2 using the “yts_fct” data. You should see different ordering in the plot and count table.

yts_fct %>%
  ggplot(aes(x = Education, y = Sample_Size)) +
  geom_boxplot()
## Warning: Removed 425 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

yts_fct %>%
  count(Education)
## # A tibble: 2 × 2
##   Education         n
##   <fct>         <int>
## 1 Middle School  5206
## 2 High School    4588

Practice on Your Own!

P.1

Convert “LocationAbbr” (state) in “yts_fct” into a factor using the mutate and factor functions. Do not add a levels = argument.

yts_fct <- yts_fct %>% mutate(LocationAbbr = factor(LocationAbbr))

P.2

We want to create a new column that contains the group-level median sample size.

  • Using the “yts_fct” data, group_by “LocationAbbr”.
  • Then, use mutate to create a new column “med_sample_size” that is the median “Sample_Size”.
  • Hint: Since you have already done group_by, a median “Sample_Size” will automatically be created for each unique level in “LocationAbbr”. Use the median function with na.rm = TRUE.
yts_fct <- yts_fct %>%
  group_by(LocationAbbr) %>%
  mutate(med_sample_size = median(Sample_Size, na.rm = TRUE))

P.3

We want to plot the “LocationAbbr” (state) by the “med_sample_size” column we created above. Using the forcats package, create a plot that:

  • Has “LocationAbbr” on the x-axis
  • Uses the mapping argument and the fct_reorder function to order the x-axis by “med_sample_size”
  • Has “Sample_Size” on the y-axis
  • Is a boxplot (geom_boxplot)
  • Has the x axis label of State (Don’t worry if you get a warning about not being able to plot NA values.)

Save your plot using ggsave() with a width of 10 and height of 3.

Which state has the largest median sample size?

library(forcats)

yts_fct_plot <- yts_fct %>%
  drop_na() %>%
  ggplot(aes(
    x = fct_reorder(
      LocationAbbr, med_sample_size
    ),
    y = Sample_Size
  )) +
  geom_boxplot() +
  labs(x = "State")

ggsave(
  filename = "yts_fct.png", # will save in working directory
  plot = yts_fct_plot,
  width = 10, height = 3
)