--- title: "Data Summarization" output: ioslides_presentation: css: ../styles.css widescreen: yes --- ```{r, echo = FALSE, message=FALSE, error = FALSE} library(knitr) opts_chunk$set(comment = "", message = FALSE) suppressWarnings({library(dplyr)}) library(readr) library(tidyverse) library(jhur) ``` ## Data Summarization * Basic statistical summarization * `mean(x)`: takes the mean of x * `sd(x)`: takes the standard deviation of x * `median(x)`: takes the median of x * `quantile(x)`: displays sample quantiles of x. Default is min, IQR, max * `range(x)`: displays the range. Same as `c(min(x), max(x))` * `sum(x)`: sum of x * `max(x)`: maximum value in x * `min(x)`: minimum value in x * **all have a **`na.rm` for missing data * Transformations * `log` - log (base `e`) transformation * `log10` - log base 10 transform * `sqrt` - square root ## Statistical summarization The vector getting summarized goes inside the parentheses: ```{r} x <- c(1, 5, 7, 4, 2, 8) mean(x) range(x) sum(x) ``` ## Statistical summarization Note that many of these functions have additional inputs regarding missing data, typically requiring the `na.rm` argument ("remove NAs"). ```{r error = TRUE} x <- c(1, 5, 7, 4, 2, 8, NA) mean(x) mean(x, na.rm = TRUE) quantile(x) quantile(x, na.rm = TRUE) ``` ## Statistical summarization{.codesmall} We will talk more about data types later, but you can only do summarization on numeric or logical types. Not characters or factors. ```{r error = TRUE} x <- c(1, 5, 7, 4, 2, 8) sum(x) y <- c(TRUE, FALSE, FALSE, TRUE) # FALSE == 0 and TRUE == 1 sum(y) z <- c("TRUE", "FALSE", "FALSE", "TRUE") sum(z) mean(z) ``` ## Some examples We can use the `jhu_cars` to explore different ways of summarizing data. The `head` command displays the first rows of an object: ```{r} library(jhur) head(jhu_cars) ``` ## Statistical summarization Note - the `$` references/selects columns from a `data.frame`/`tibble`: ```{r} mean(jhu_cars$hp) quantile(jhu_cars$hp) ``` ## Statistical summarization The "tidy" way: ```{r} jhu_cars %>% pull(hp) %>% mean() # alt: pull(jhu_cars, hp) %>% mean() jhu_cars %>% pull(hp) %>% quantile() ``` ## Statistical summarization ```{r} jhu_cars %>% pull(wt) %>% median() jhu_cars %>% pull(wt) %>% quantile(probs = 0.6) ``` ## Data Summarization on data frames * Basic statistical summarization * `rowMeans(x)`: takes the means of each row of x * `colMeans(x)`: takes the means of each column of x * `rowSums(x)`: takes the sum of each row of x * `colSums(x)`: takes the sum of each column of x * `summary(x)`: for data frames, displays the quantile information ## TB Incidence Let's read in a `tibble` of values from TB incidence. If you have the `jhur` package installed successfully: ```{r} tb <- jhur::read_tb() ```
If not, download the `xlsx` file from this link and read it in using `read_csv()`: http://jhudatascience.org/intro_to_R_class/data/tb_incidence.xlsx ## TB Incidence Check out the data: ```{r} head(tb) colnames(tb) ``` ## Indicator of TB Before we go further, let's rename the first column to be the country measured using the `rename` function in `dplyr`. In this case, we have to use the backticks (\`) because there are spaces and funky characters in the name: ```{r} library(dplyr) tb <- tb %>% rename(country = `TB incidence, all forms (per 100 000 population per year)`) ```
`colnames` will show us the column names and show that country is renamed: ```{r} colnames(tb) ``` ## Summarize the data: `dplyr` `summarize` function `dplyr::summarize` will allow you to summarize data. Format is `new = SUMMARY`. ```{r, eval = FALSE} # General format - Not the code! {data object to update} <- {data to use} %>% summarize({summary column name} = {operator(source column)}) ``` ```{r} tb %>% summarize(mean_2006 = mean(`2006`, na.rm = TRUE)) ``` ## Summarize the data: `dplyr` `summarize` function `summarize` can do multiple operations at once. Just separate by a comma. ```{r} tb %>% summarize(mean_2006 = mean(`2006`, na.rm = TRUE), median_2007 = median(`2007`, na.rm = TRUE), median(`2004`, na.rm = TRUE)) ```
Notice how when we forget to provide a new name, output is still provided, but the column name is messy. ## Iterative summaries: `dplyr` `summarize` and `across` functions Use the [`across`](https://dplyr.tidyverse.org/reference/across.html) function with `summarize` to summarize across multiple columns of your data. ```{r} tb %>% summarize(across( c(`1990`, `1991`, `1992`, `1993`), ~ sum(.x, na.rm = TRUE))) tb %>% summarize(across( starts_with("2"), ~ range(.x, na.rm = TRUE))) ``` ## Row means `colMeans` and `rowMeans` require **all numeric data**. Let's see what the mean is across each row (country): ```{r} tb_2 <- column_to_rownames(tb, "country") # opposite of rownames_to_column() ! head(tb_2, 2) rowMeans(tb_2, na.rm = TRUE) ``` ## Row means `colMeans` gives you very similar output to functions we've seen previously in this lecture (`summarize` and `across`). ```{r} colMeans(tb_2, na.rm = TRUE) tb_2 %>% summarize(across( colnames(tb_2), ~ mean(.x, na.rm = TRUE))) ``` ## `summary` Function Using `summary` can give you rough snapshots of each column, but you would likely use `mean`, `min`, `max`, and `quantile` when necessary (and number of NAs): ```{r} summary(tb) ``` ## Lab Part 1 [Website](http://jhudatascience.org/intro_to_R_class/index.html) ## Youth Tobacco Survey Here we will be using the Youth Tobacco Survey data: http://jhudatascience.org/intro_to_R_class/data/Youth_Tobacco_Survey_YTS_Data.csv ```{r} yts <- jhur::read_yts() head(yts) ``` ## Length and unique `unique(x)` will return the unique elements of `x` ```{r, message = FALSE} locations <- yts %>% pull(LocationDesc) unique(locations) %>% head() ``` `length` will tell you the length of a vector. Combined with `unique`, tells you the number of unique elements: ```{r} length(unique(locations)) ``` ## `table` and `dplyr`: `count` `table(x)` will return a frequency table of unique elements of `x` ```{r, message = FALSE} table(locations) ``` ## `table` and `dplyr`: `count` Use `count` directly on a data.frame and column without needing to use `pull`. ```{r, message = FALSE} yts %>% count(LocationDesc) ``` ## `table` and `dplyr`: `count` Multiple columns listed further subdivides the count. ```{r, message = FALSE} yts %>% count(LocationDesc, TopicDesc) ``` # Grouping ## Perform Operations By Groups: dplyr `group_by` allows you group the data set by grouping variables: ```{r} # yts ``` ## Perform Operations By Groups: dplyr `group_by` allows you group the data set by grouping variables: ```{r} yts <- yts %>% group_by(Response) yts ``` ## Summarize the grouped data It's grouped! Grouping doesn't change the data in any way, but how **functions operate on it**. Now we can summarize `Data_Value` (percent of respondents) by group: ```{r} yts %>% summarize(avg_percent = mean(Data_Value, na.rm = TRUE)) ``` ## Using the `pipe` to connect these Pipe `yts` into `group_by`, then pipe that into `summarize`: ```{r} yts %>% group_by(Response) %>% summarize(avg_percent = mean(Data_Value, na.rm = TRUE), max_percent = max(Data_Value, na.rm = TRUE)) ``` ## Ungroup the data The `ungroup` function will allow you to clear the groups from the data. You can also overwrite the first `group_by` with a new one. ```{r} yts = ungroup(yts) yts ``` ## `group_by` with `mutate` - just add data We can also use `mutate` to calculate the mean value for each year and add it as a column: ```{r} yts %>% group_by(YEAR) %>% mutate(year_avg = mean(Data_Value, na.rm = TRUE)) %>% select(LocationDesc, Data_Value, year_avg) ``` ## Counting There are other functions, such as `n()` count the number of observations. ```{r} yts %>% group_by(YEAR) %>% summarize(n = n(), mean = mean(Data_Value, na.rm = TRUE)) ``` ## Lab Part 2 [Website](http://jhudatascience.org/intro_to_R_class/index.html) # Preview: plotting ## Basic Plots Plotting is an important component of exploratory data analysis. These are some rough one-line plots that you can use in realtime while exploring your data. We will go over formatting and making plots look nicer in additional lectures. * Basic summarization plots: * `plot(x,y)`: scatterplot of x and y * `boxplot(y~x)`: boxplot of y against levels of x * `hist(x)`: histogram of x * `plot(density(x))`: kernel density plot of x ## Scatterplot ```{r} plot( pull(jhu_cars,hp), pull(jhu_cars,mpg) ) # alt: plot(jhu_cars$hp, jhu_cars$mpg) ``` ## Boxplot ```{r} boxplot( pull(jhu_cars,hp) ~ pull(jhu_cars,cyl) ) ``` ## Histogram ```{r} hist(pull(jhu_cars,mpg)) ``` ## Histogram Use the `breaks =` argument to tweak the resolution: ```{r} hist(pull(jhu_cars,mpg), breaks = 10) ``` ## Density ```{r} plot(density(pull(jhu_cars,mpg))) ``` ## Lab Part 3 [Website](http://jhudatascience.org/intro_to_R_class/index.html)