Helpful tips before we start

TROUBLESHOOTING: Common new user mistakes we have seen

  • Check the file path – is the file there?
  • Typos (R is case sensitive, x and X are different)
  • Open ended quotes, parentheses, and brackets
  • Deleting part of the code chunk
  • For any function, you can write ?FUNCTION_NAME, or help("FUNCTION_NAME") to look at the help file

R Projects

R Projects can help you keep files organized and avoid issues with working directories. Check out our resource here: https://jhudatascience.org/intro_to_r/resources/R_Projects.html

Lab

In this lab you can use the interactive console to explore or Knit the document. Remember anything you type here can be “sent” to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) in an R code chunk.

# Load the necessary package
library(readr)

1.1

Use the manual import method (File > Import Dataset > From Text (readr)) to Read in SARS-CoV-2 vaccination data from this URL:

https://jhudatascience.org/intro_to_r/data/vaccinations.csv.

You can learn more about how the data was collected here: https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

1.2

What is the dataset object called? You can find this information in the Console or the Environment. Enter your answer as a comment using #.

# vaccinations

1.3

Preview the data by clicking the table button in the Environment. How many observations and variables are there? Enter your answer as a comment using #.

# 37272 obs. of 103 variables

1.4

Read in SARS-CoV-2 vaccination data from URL https://jhudatascience.org/intro_to_r/data/vaccinations.csv and assign it to an object named vacc. Use the code structure below.

# General format
library(readr)
# OBJECT <- read_csv(FILE)
library(readr)
vacc <- read_csv(file = "https://jhudatascience.org/intro_to_r/data/vaccinations.csv")
## Rows: 37272 Columns: 103
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (2): Date, Location
## dbl (101): MMWR_week, Distributed, Distributed_Janssen, Distributed_Moderna,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.5

Take a look at the data. Do these data objects (vaccinations and vacc) appear to be the same? Why or why not?

# Yes, when we look in the RStudio environment, the two objects have the same dimensions. If we use the View() or str() functions, we can also see in more detail that the data is the same. 
# If we wanted to get really in the weeds, we could do a logical test like all.equal(vacc, vaccinations)

1.6

Learn your working directory by running getwd(). This is where R will look for files unless you tell it otherwise.

getwd()
## [1] "/__w/intro_to_r/intro_to_r/modules/Data_Input/lab"

Practice on Your Own!

P.1

Load the readxl package with the library() command.

If it is not installed, install it via: RStudio --> Tools --> Install Packages. You can also try install.packages("readxl").

library(readxl)

P.2

Download the dataset of asthma prevalence in the USA from: https://jhudatascience.org/intro_to_r/data/asthma.xlsx file to asthma.xlsx by running the following code chunk. This only downloads the file, it does NOT bring the file into R.

download.file(
  url = "https://jhudatascience.org/intro_to_r/data/asthma.xlsx",
  destfile = "asthma.xlsx",
  overwrite = TRUE,
  mode = "wb"
)

Note: the “wb” option makes sure the file can be read correctly on Windows and Apple machines.

P.3

Use the read_excel() function in the readxl package to read the asthma.xlsx file and call the output asthma.

asthma <- read_excel(path = "asthma.xlsx")

P.4

Run the following code - is there a problem? How do you know?

yts <- read_delim("https://jhudatascience.org/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv", delim = "\t")
## Rows: 9794 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): YEAR,LocationAbbr,LocationDesc,TopicType,TopicDesc,MeasureDesc,Data...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
yts
## # A tibble: 9,794 × 1
##    YEAR,LocationAbbr,LocationDesc,TopicType,TopicDesc,MeasureDesc,DataSource,R…¹
##    <chr>                                                                        
##  1 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cessation (Youth),Percent of Curr…
##  2 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cessation (Youth),Percent of Curr…
##  3 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cessation (Youth),Percent of Curr…
##  4 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cessation (Youth),Quit Attempt in…
##  5 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cessation (Youth),Quit Attempt in…
##  6 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cessation (Youth),Quit Attempt in…
##  7 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cigarette Use (Youth),Smoking Sta…
##  8 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cigarette Use (Youth),Smoking Sta…
##  9 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cigarette Use (Youth),Smoking Sta…
## 10 "2015,AZ,Arizona,Tobacco Use – Survey Data,Cigarette Use (Youth),Smoking Sta…
## # ℹ 9,784 more rows
## # ℹ abbreviated name:
## #   ¹​`YEAR,LocationAbbr,LocationDesc,TopicType,TopicDesc,MeasureDesc,DataSource,Response,Data_Value_Unit,Data_Value_Type,Data_Value,Data_Value_Footnote_Symbol,Data_Value_Footnote,Data_Value_Std_Err,Low_Confidence_Limit,High_Confidence_Limit,Sample_Size,Gender,Race,Age,Education,GeoLocation,TopicTypeId,TopicId,MeasureId,StratificationID1,StratificationID2,StratificationID3,StratificationID4,SubMeasureID,DisplayOrder`
# It should be a red flag to see that there is only one column that looks like: "YEAR,LocationAbbr,LocationDesc,TopicType,TopicDesc,MeasureDesc..."
# This file is comma delimited, not tab delimited!

P.5

By default, R reads the first sheet of an excel file. Copy your code from question P.3 and add the following argument: sheet = 2. Inspect the data using head().

asthma <- read_excel(path = "asthma.xlsx", sheet = 2)
head(asthma)
## # A tibble: 6 × 3
##   Characteristic      `Weighted Number With Current Asthma` `Percent (SE)`
##   <chr>                                               <dbl> <chr>         
## 1 0–4                                                394206 2.0 (0.43)    
## 2 5–11                                              1641279 5.9 (0.58)    
## 3 5–14                                              2699214 6.6 (0.55)    
## 4 5-17 (School Age)                                 3832453 7.2 (0.49)    
## 5 12-14 (Young Teens)                               1057935 8.1 (1.10)    
## 6 12–17                                             2191174 8.6 (0.77)

P.6

Install and load the haven package. Look at the help page for read_dta() function, and scroll to the very bottom of the page. Try running some of the examples provided.

install.packages("haven")
library(haven)
?read_dta

path <- system.file("examples", "iris.dta", package = "haven")
read_dta(path)