* Character: strings or individual characters, quoted * Numeric: any real number(s) * Integer: any integer(s)/whole numbers (1,2,3) * Double: any number with fractional values (1.2, 4.01, 1.00004) * Factor: categorical/qualitative variables * Logical: variables composed of TRUE or FALSE * Date/POSIXct: represents calendar dates and times
We have already covered character and numeric types.
class(c("tree", "cloud", "stars_&_sky"))
## [1] "character"
class(c(1, 4, 7))
## [1] "numeric"
This can also be a bit tricky.
class(c(1, 2, "tree"))
## [1] "character"
class(c("1", "4", "7"))
## [1] "character"
logical is a type that only has two possible elements: TRUE and FALSE
x <- c(TRUE, FALSE, TRUE, TRUE, FALSE) class(x)
## [1] "logical"
Note that logical elements are NOT in quotes.
z <- c("TRUE", "FALSE", "TRUE", "FALSE")
class(z)
## [1] "character"
The class of the data tells R how to process the data. For example, it determines whether you can make summary statistics (numbers) or if you can sort alphabetically (characters).
There is one useful functions associated with practically all R classes:
as.CLASS_NAME(x) coerces between classes. It turns x into a certain class.
Examples:
as.numeric()as.character()as.logical()as.double()as.integer()as.Date()as.factor() (More on this one later!)Sometimes coercing works great!
as.character(4)
## [1] "4"
as.numeric(c("1", "4", "7"))
## [1] 1 4 7
as.logical(c("TRUE", "FALSE", "FALSE"))
## [1] TRUE FALSE FALSE
as.logical(0)
## [1] FALSE
When interpretation is ambiguous, R will return NA (an R constant representing “Not Available” i.e. missing value)
as.numeric(c("1", "4", "7a"))
## Warning: NAs introduced by coercion
## [1] 1 4 NA
as.logical(c("TRUE", "FALSE", "UNKNOWN"))
## [1] TRUE FALSE NA
What is one reason we might want to convert data to numeric?
A. So we can take the mean
B. So the data looks better
C. So our data is correct
There are two major number subclasses or types:
Double (1.003)
Integer (1)
Double is equivalent to numeric. It is a number that contains 
fractional values . Can be any amount of places after the decimal.
Double stands for double-precision
y <- c(1.1, 2.0, 3.21, 4.5, 5.62) y
## [1] 1.10 2.00 3.21 4.50 5.62
class(y)
## [1] "numeric"
typeof(y)
## [1] "double"
The num function of the tibble package can be used to change format. See here for more: https://tibble.tidyverse.org/articles/numbers.html
Integer is a special number that contains only 
whole numbers.
y
## [1] 1.10 2.00 3.21 4.50 5.62
y_int <- as.integer(y) y_int
## [1] 1 2 3 4 5
class(y_int)
## [1] "integer"
typeof(y_int)
## [1] "integer"
Can use as.integer() function to create integers (unless they are read in as integers or created as such with seq and sample). Otherwise, will be double by default.
x <- c(1, 2, 3, 4, 5) # technically integers class(x)
## [1] "numeric"
typeof(x)
## [1] "double"
A tibble will show the difference (as does glimpse()).
my_data <- tibble(double_var = y, int_var = y_int) my_data
## # A tibble: 5 × 2 ## double_var int_var ## <dbl> <int> ## 1 1.1 1 ## 2 2 2 ## 3 3.21 3 ## 4 4.5 4 ## 5 5.62 5
glimpse(my_data)
## Rows: 5 ## Columns: 2 ## $ double_var <dbl> 1.10, 2.00, 3.21, 4.50, 5.62 ## $ int_var <int> 1, 2, 3, 4, 5
A factor is a special character vector where the elements have pre-defined groups or ‘levels’. You can think of these as qualitative or categorical variables. Order is often important.
Examples:
Use the factor() function to create factors.
x <- c("small", "medium", "large", "medium", "large")
class(x)
## [1] "character"
x_fact <- factor(x) class(x_fact)
## [1] "factor"
x_fact
## [1] small medium large medium large ## Levels: large medium small
Note that levels are, by default, in alphanumerical order!
Q: Why not use as.factor() ?
A: You can coerce with as.factor(). But you can’t specify levels! More on this soon.
You can learn what are the unique levels of a factor vector
levels(x_fact)
## [1] "large" "medium" "small"
More on how to change the levels ordering in a lecture coming up!
Factors can be converted to numeric or character very easily.
x_fact
## [1] small medium large medium large ## Levels: large medium small
as.character(x_fact)
## [1] "small" "medium" "large" "medium" "large"
as.numeric(x_fact)
## [1] 3 2 1 2 1
circ <- read_csv( "https://jhudatascience.org/intro_to_r/data/Charm_City_Circulator_Ridership.csv" )
## Rows: 1146 Columns: 15 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): day, date ## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(circ)
## # A tibble: 6 × 15 ## day date orangeBoardings orangeAlightings orangeAverage purpleBoardings ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Monday 01/1… 877 1027 952 NA ## 2 Tuesday 01/1… 777 815 796 NA ## 3 Wednesday 01/1… 1203 1220 1212. NA ## 4 Thursday 01/1… 1194 1233 1214. NA ## 5 Friday 01/1… 1645 1643 1644 NA ## 6 Saturday 01/1… 1457 1524 1490. NA ## # ℹ 9 more variables: purpleAlightings <dbl>, purpleAverage <dbl>, ## # greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>, ## # bannerBoardings <dbl>, bannerAlightings <dbl>, bannerAverage <dbl>, ## # daily <dbl>
Say we want to change daily to be an integer. We would need to use mutate(). Let’s create a new column ‘daily_int’ so it is easier to see what is happening.
circ %>% mutate(daily_int= as.integer(daily)) %>% select(daily, daily_int)
## # A tibble: 1,146 × 2 ## daily daily_int ## <dbl> <int> ## 1 952 952 ## 2 796 796 ## 3 1212. 1211 ## 4 1214. 1213 ## 5 1644 1644 ## 6 1490. 1490 ## 7 888. 888 ## 8 1000. 999 ## 9 1035 1035 ## 10 1396. 1395 ## # ℹ 1,136 more rows
| Example | Class | Type | Notes | 
|---|---|---|---|
| 1.1 | Numeric | double | default for numbers | 
| 1 | integer | integer | Need to coerce to integer with as.integer() or use sample() or seq() with whole numbers | 
| “FALSE”, “Ball” | Character | Character | Need quotes | 
| FALSE, TRUE | logical | logical | No quotes | 
| “Small”, “Large” | Factor | Factor | Need to coerce to factor with factor() | 
TRUE or FALSE (without quotes)class() can be used to test the class of an object xas.CLASS_NAME(x) can be used to change the class of an object xThere are two most popular R classes used when working with dates and times:
Date class representing a calendar datePOSIXct class representing a calendar date with hours, minutes, secondsWe convert data from character to Date/POSIXct to use functions to manipulate date/date and time
lubridate is a powerful, widely used R package from “tidyverse” family to work with Date / POSIXct class objects
Date class objectclass("2021-06-15")
## [1] "character"
library(lubridate)
ymd("2021-06-15") # lubridate package Year Month Day
## [1] "2021-06-15"
class(ymd("2021-06-15")) # lubridate package
## [1] "Date"
class(date("2021-06-15")) # lubridate package
## [1] "Date"
Note for function ymd: year month day
mdy("06/15/2021")
## [1] "2021-06-15"
dmy("15-June-2021")
## [1] "2021-06-15"
ymd("2021-06-15")
## [1] "2021-06-15"
Must match the data format!
ymd("06/15/2021") # This doesn't work - gives NA
## Warning: All formats failed to parse. No formats found.
## [1] NA
mdy("06/15/2021") # This works
## [1] "2021-06-15"
a <- ymd("2021-06-15")
b <- ymd("2021-06-18")
a - b
## Time difference of -3 days
class("2013-01-24 19:39:07")
## [1] "character"
ymd_hms("2013-01-24 19:39:07") # lubridate package
## [1] "2013-01-24 19:39:07 UTC"
ymd_hms("2013-01-24 19:39:07") %>% class()
## [1] "POSIXct" "POSIXt"
UTC represents time zone, by default: Coordinated Universal Time
Note for function ymd_hms: year month day hour minute second.
Note dates are always displayed year month day, even if made with mdy!
circ_dates <- circ %>% select(date) circ_dates <- circ_dates %>% mutate(date_formatted = mdy(date)) glimpse(circ_dates)
## Rows: 1,146 ## Columns: 2 ## $ date <chr> "01/11/2010", "01/12/2010", "01/13/2010", "01/14/2010",… ## $ date_formatted <date> 2010-01-11, 2010-01-12, 2010-01-13, 2010-01-14, 2010-0…
circ_dates %>% mutate(year = year(date_formatted)) %>% mutate(month = month(date_formatted)) %>% glimpse()
## Rows: 1,146 ## Columns: 4 ## $ date <chr> "01/11/2010", "01/12/2010", "01/13/2010", "01/14/2010",… ## $ date_formatted <date> 2010-01-11, 2010-01-12, 2010-01-13, 2010-01-14, 2010-0… ## $ year <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2… ## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Two-dimensional classes are those we would often use to store data read from a file
data.frame or tibble class)matrix class)
data.frame or tibble, the entire matrix is composed of one R classnumeric, or all entries are characterlists.Can be created using list()
mylist <- list(c("A", "b", "c"), c(1, 2, 3))
mylist
## [[1]] ## [1] "A" "b" "c" ## ## [[2]] ## [1] 1 2 3
class(mylist)
## [1] "list"
as.numeric() or as.character()Date class using ymd(), mdy() functions from lubridate package💻 Lab
📃 Day 4 Cheatsheet See the extra slides for more advanced topics.
Image by Gerd Altmann from Pixabay
as.matrix() creates a matrix from a data frame or tibble (where all values are the same class).
circ_mat <- select(circ, contains("orange")) %>%
  head(n = 3)
circ_mat
## # A tibble: 3 × 3 ## orangeBoardings orangeAlightings orangeAverage ## <dbl> <dbl> <dbl> ## 1 877 1027 952 ## 2 777 815 796 ## 3 1203 1220 1212.
as.matrix(circ_mat)
## orangeBoardings orangeAlightings orangeAverage ## [1,] 877 1027 952.0 ## [2,] 777 815 796.0 ## [3,] 1203 1220 1211.5
matrix() creates a matrix from scratch.
matrix(1:6, ncol = 2)
## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
List elements can be named
mylist_named <- list(
  letters = c("A", "b", "c"),
  numbers = c(1, 2, 3),
  one_matrix = matrix(1:4, ncol = 2)
)
mylist_named
## $letters ## [1] "A" "b" "c" ## ## $numbers ## [1] 1 2 3 ## ## $one_matrix ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4
lubridate to manipulate Date objectsx <- ymd(c("2021-06-15", "2021-07-15"))
x
## [1] "2021-06-15" "2021-07-15"
day(x) # see also: month(x) , year(x)
## [1] 15 15
x + days(10)
## [1] "2021-06-25" "2021-07-25"
x + months(1) + days(10)
## [1] "2021-07-25" "2021-08-25"
wday(x, label = TRUE)
## [1] Tue Thu ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
lubridate to manipulate POSIXct objectsx <- ymd_hms("2013-01-24 19:39:07")
x
## [1] "2013-01-24 19:39:07 UTC"
date(x)
## [1] "2013-01-24"
x + hours(3)
## [1] "2013-01-24 22:39:07 UTC"
floor_date(x, "1 hour") # see also: ceiling_date()
## [1] "2013-01-24 19:00:00 UTC"
x1 <- ymd(c("2021-06-15"))
x2 <- ymd(c("2021-07-15"))
difftime(x2, x1, units = "weeks")
## Time difference of 4.285714 weeks
as.numeric(difftime(x2, x1, units = "weeks"))
## [1] 4.285714
Similar can be done with time (e.g. difference in hours).
n <- 1:9 n
## [1] 1 2 3 4 5 6 7 8 9
mat <- matrix(n, nrow = 3) mat
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
To get element(s) of a vector (one-dimensional object):
[ ]x <- c("a", "b", "c", "d", "e", "f", "g", "h")
x
## [1] "a" "b" "c" "d" "e" "f" "g" "h"
x[2]
## [1] "b"
x[c(1, 2, 100)]
## [1] "a" "b" NA
Note you cannot use dplyr functions (like select) on matrices. To subset matrix rows and/or columns, use matrix[row_index, column_index].
mat
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
mat[1, 1] # individual entry: row 1, column 1
## [1] 1
mat[1, 2] # individual entry: row 1, column 2
## [1] 4
mat[1, ] # first row
## [1] 1 4 7
mat[, 1] # first column
## [1] 1 2 3
mat[c(1, 2), c(2, 3)] # subset of original matrix: two rows and two columns
## [,1] [,2] ## [1,] 4 7 ## [2,] 5 8
You can reference data from list using $ (if elements are named) or using [[ ]]
mylist_named[[1]]
## [1] "A" "b" "c"
mylist_named[["letters"]] # works only for a list with elements' names
## [1] "A" "b" "c"
mylist_named$letters # works only for a list with elements' names
## [1] "A" "b" "c"