Load all the libraries we will use in this lab.
library(dplyr)
library(ggplot2)
Load the Youth Tobacco Survey data (using the jhur
library function read_yts()
). select
“Sample_Size”, “Education”, and “LocationAbbr”. Name this data
“yts”.
library(jhur)
yts <- read_yts() %>% select(Sample_Size, Education, LocationAbbr)
# Alt:
# yts <- read_csv("http://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv")
Create a boxplot showing the difference in “Sample_Size” between
Middle School and High School “Education”. Hint: Use
aes(x = Education, y = Sample_Size)
and
geom_boxplot()
.
yts %>%
ggplot(mapping = aes(x = Education, y = Sample_Size)) +
geom_boxplot()
## Warning: Removed 425 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Use count
to count up the number of observations of data
for each “Education” group.
yts %>%
count(Education)
## # A tibble: 2 × 2
## Education n
## <chr> <int>
## 1 High School 4588
## 2 Middle School 5206
Make “Education” a factor using the mutate
and
factor
functions. Use the levels
argument
inside factor
to reorder “Education”. Reorder this variable
so that “Middle School” comes before “High School”. Assign the output
the name “yts_fct”.
yts_fct <-
yts %>% mutate(Education = factor(Education,
levels = c("Middle School", "High School")
))
Repeat question 1.1 and 1.2 using the “yts_fct” data. You should see
different ordering in the plot and count
table.
yts_fct %>%
ggplot(mapping = aes(x = Education, y = Sample_Size)) +
geom_boxplot()
## Warning: Removed 425 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
yts_fct %>%
count(Education)
## # A tibble: 2 × 2
## Education n
## <fct> <int>
## 1 Middle School 5206
## 2 High School 4588
Convert “LocationAbbr” (state) in “yts_fct” into a factor using the
mutate
and factor
functions. Do not add a
levels =
argument.
yts_fct <- yts_fct %>% mutate(LocationAbbr = factor(LocationAbbr))
We want to create a new column that contains the group-level median sample size.
group_by
“LocationAbbr”.mutate
to create a new column
“med_sample_size” that is the median “Sample_Size”.group_by
, a median “Sample_Size” will automatically be
created for each unique level in “LocationAbbr”. Use the
median
function with na.rm = TRUE
.yts_fct <- yts_fct %>%
group_by(LocationAbbr) %>%
mutate(med_sample_size = median(Sample_Size, na.rm = TRUE))
We want to plot the “LocationAbbr” (state) by the “med_sample_size”
column we created above. Using the forcats
package, create
a plot that:
mapping
argument and the
fct_reorder
function to order the x-axis by
“med_sample_size”geom_boxplot
)State
(Don’t worry if you get a
warning about not being able to plot NA
values.)Save your plot using ggsave()
with a width of 10 and
height of 3.
Which state has the largest median sample size?
library(forcats)
yts_fct_plot <- yts_fct %>%
drop_na() %>%
ggplot(mapping = aes(
x = fct_reorder(
LocationAbbr, med_sample_size
),
y = Sample_Size
)) +
geom_boxplot() +
labs(x = "State")
ggsave(
filename = "yts_fct.png", # will save in working directory
plot = yts_fct_plot,
width = 10, height = 3
)