Load all the libraries we will use in this lab.
library(tidyverse)
Load the Youth Tobacco Survey data from http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv. select
“Sample_Size”, “Education”, and “LocationAbbr”. Name this data “yts”.
yts <- read_csv("http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv")
## Rows: 9794 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): LocationAbbr, LocationDesc, TopicType, TopicDesc, MeasureDesc, Dat...
## dbl (7): YEAR, Data_Value, Data_Value_Std_Err, Low_Confidence_Limit, High_C...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
yts <- yts %>% select(Sample_Size, Education, LocationAbbr)
Create a boxplot showing the difference in “Sample_Size” between Middle School and High School “Education”. Hint: Use aes(x = Education, y = Sample_Size)
and geom_boxplot()
.
yts %>%
ggplot(aes(x = Education, y = Sample_Size)) +
geom_boxplot()
## Warning: Removed 425 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Use count
to count up the number of observations of data for each “Education” group.
yts %>%
count(Education)
## # A tibble: 2 × 2
## Education n
## <chr> <int>
## 1 High School 4588
## 2 Middle School 5206
Make “Education” a factor using the mutate
and factor
functions. Use the levels
argument inside factor
to reorder “Education”. Reorder this variable so that “Middle School” comes before “High School”. Assign the output the name “yts_fct”.
yts_fct <-
yts %>% mutate(Education = factor(Education,
levels = c("Middle School", "High School")
))
Repeat question 1.1 and 1.2 using the “yts_fct” data. You should see different ordering in the plot and count
table.
yts_fct %>%
ggplot(aes(x = Education, y = Sample_Size)) +
geom_boxplot()
## Warning: Removed 425 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
yts_fct %>%
count(Education)
## # A tibble: 2 × 2
## Education n
## <fct> <int>
## 1 Middle School 5206
## 2 High School 4588
Convert “LocationAbbr” (state) in “yts_fct” into a factor using the mutate
and factor
functions. Do not add a levels =
argument.
yts_fct <- yts_fct %>% mutate(LocationAbbr = factor(LocationAbbr))
We want to create a new column that contains the group-level median sample size.
group_by
“LocationAbbr”.mutate
to create a new column “med_sample_size” that is the median “Sample_Size”.group_by
, a median “Sample_Size” will automatically be created for each unique level in “LocationAbbr”. Use the median
function with na.rm = TRUE
.yts_fct <- yts_fct %>%
group_by(LocationAbbr) %>%
mutate(med_sample_size = median(Sample_Size, na.rm = TRUE))
We want to plot the “LocationAbbr” (state) by the “med_sample_size” column we created above. Using the forcats
package, create a plot that:
mapping
argument and the fct_reorder
function to order the x-axis by “med_sample_size”geom_boxplot
)State
(Don’t worry if you get a warning about not being able to plot NA
values.)Save your plot using ggsave()
with a width of 10 and height of 3.
Which state has the largest median sample size?
library(forcats)
yts_fct_plot <- yts_fct %>%
drop_na() %>%
ggplot(aes(
x = fct_reorder(
LocationAbbr, med_sample_size
),
y = Sample_Size
)) +
geom_boxplot() +
labs(x = "State")
ggsave(
filename = "yts_fct.png", # will save in working directory
plot = yts_fct_plot,
width = 10, height = 3
)