* Character: strings or individual characters, quoted * Numeric: any real number(s) * Integer: any integer(s)/whole numbers (1,2,3) * Double: any number with fractional values (1.2, 4.01, 1.00004) * Factor: categorical/qualitative variables * Logical: variables composed of TRUE or FALSE * Date/POSIXct: represents calendar dates and times
We have already covered character
and numeric
types.
class(c("tree", "cloud", "stars_&_sky"))
## [1] "character"
class(c(1, 4, 7))
## [1] "numeric"
This can also be a bit tricky.
class(c(1, 2, "tree"))
## [1] "character"
class(c("1", "4", "7"))
## [1] "character"
logical
is a type that only has two possible elements: TRUE
and FALSE
x <- c(TRUE, FALSE, TRUE, TRUE, FALSE) class(x)
## [1] "logical"
Note that logical
elements are NOT in quotes.
z <- c("TRUE", "FALSE", "TRUE", "FALSE") class(z)
## [1] "character"
There is one useful functions associated with practically all R classes:
as.CLASS_NAME(x)
coerces between classes. It turns x
into a certain class.
Examples:
as.numeric()
as.character()
as.logical()
as.double()
as.integer()
as.Date()
as.factor()
(More on this one later!)Sometimes coercing works great!
as.character(4)
## [1] "4"
as.numeric(c("1", "4", "7"))
## [1] 1 4 7
as.logical(c("TRUE", "FALSE", "FALSE"))
## [1] TRUE FALSE FALSE
as.logical(0)
## [1] FALSE
When interpretation is ambiguous, R will return NA
(an R constant representing “Not Available” i.e. missing value)
as.numeric(c("1", "4", "7a"))
## Warning: NAs introduced by coercion
## [1] 1 4 NA
as.logical(c("TRUE", "FALSE", "UNKNOWN"))
## [1] TRUE FALSE NA
as.Date(c("2021-06-15", "2021-06-32"))
## [1] "2021-06-15" NA
There are two major number subclasses or types
Double
is equivalent to numeric
. It is a number that contains
fractional values . Can be any amount of places after the decimal.
Double
stands for double-precision
y <- c(1.1, 2.0, 3.21, 4.5, 5.62) y
## [1] 1.10 2.00 3.21 4.50 5.62
class(y)
## [1] "numeric"
typeof(y)
## [1] "double"
The num
function of the tibble
package can be used to change format. See here for more: https://tibble.tidyverse.org/articles/numbers.html
Integer
is a special number that contains only
whole numbers.
y
## [1] 1.10 2.00 3.21 4.50 5.62
y_int <- as.integer(y) y_int
## [1] 1 2 3 4 5
class(y_int)
## [1] "integer"
typeof(y_int)
## [1] "integer"
Can use as.integer()
function to create integers (unless they are read in as integers or created as such with seq
and sample
). Otherwise, will be double by default.
x <- c(1, 2, 3, 4, 5) # technically integers class(x)
## [1] "numeric"
typeof(x)
## [1] "double"
A tibble
will show the difference (as does glimpse()
).
my_data <- tibble(double_var = y, int_var = y_int) my_data
## # A tibble: 5 × 2 ## double_var int_var ## <dbl> <int> ## 1 1.1 1 ## 2 2 2 ## 3 3.21 3 ## 4 4.5 4 ## 5 5.62 5
glimpse(my_data)
## Rows: 5 ## Columns: 2 ## $ double_var <dbl> 1.10, 2.00, 3.21, 4.50, 5.62 ## $ int_var <int> 1, 2, 3, 4, 5
A factor
is a special character
vector where the elements have pre-defined groups or ‘levels’. You can think of these as qualitative or categorical variables. Order is often important.
Examples:
Use the factor()
function to create factors.
x <- c("small", "medium", "large", "medium", "large") class(x)
## [1] "character"
x_fact <- factor(x) class(x_fact)
## [1] "factor"
x_fact
## [1] small medium large medium large ## Levels: large medium small
Note that levels are, by default, in alphanumerical order!
Q: Why not use as.factor()
?
A: You can coerce with as.factor()
. But you can’t specify levels! More on this soon.
You can learn what are the unique levels of a factor
vector
levels(x_fact)
## [1] "large" "medium" "small"
More on how to change the levels ordering in a lecture coming up!
Factors can be converted to numeric
or character
very easily.
x_fact
## [1] small medium large medium large ## Levels: large medium small
as.character(x_fact)
## [1] "small" "medium" "large" "medium" "large"
as.numeric(x_fact)
## [1] 3 2 1 2 1
library(jhur) circ <- read_circulator()
## Rows: 1146 Columns: 15 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): day, date ## dbl (13): orangeBoardings, orangeAlightings, orangeAverage, purpleBoardings,... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(circ)
## # A tibble: 6 × 15 ## day date orangeBoardings orangeAlightings orangeAverage purpleBoardings ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Monday 01/1… 877 1027 952 NA ## 2 Tuesday 01/1… 777 815 796 NA ## 3 Wednesday 01/1… 1203 1220 1212. NA ## 4 Thursday 01/1… 1194 1233 1214. NA ## 5 Friday 01/1… 1645 1643 1644 NA ## 6 Saturday 01/1… 1457 1524 1490. NA ## # ℹ 9 more variables: purpleAlightings <dbl>, purpleAverage <dbl>, ## # greenBoardings <dbl>, greenAlightings <dbl>, greenAverage <dbl>, ## # bannerBoardings <dbl>, bannerAlightings <dbl>, bannerAverage <dbl>, ## # daily <dbl>
Say we want to change daily
to be an integer. We would need to use mutate()
. Let’s create a new column ‘daily_int’ so it is easier to see what is happening.
circ %>% mutate(daily_int= as.integer(daily)) %>% select(daily, daily_int)
## # A tibble: 1,146 × 2 ## daily daily_int ## <dbl> <int> ## 1 952 952 ## 2 796 796 ## 3 1212. 1211 ## 4 1214. 1213 ## 5 1644 1644 ## 6 1490. 1490 ## 7 888. 888 ## 8 1000. 999 ## 9 1035 1035 ## 10 1396. 1395 ## # ℹ 1,136 more rows
Example | Class | Type | Notes |
---|---|---|---|
1.1 | Numeric | double | default for numbers |
1 | integer | integer | Need to coerce to integer with as.integer() or use sample() or seq() with whole numbers |
“FALSE”, “Ball” | Character | Character | Need quotes |
FALSE, TRUE | logical | logical | No quotes |
“Small”, “Large” | Factor | Factor | Need to coerce to factor with factor() |
TRUE
or FALSE
(without quotes)class()
can be used to test the class of an object xas.CLASS_NAME(x)
can be used to change the class of an object xTwo-dimensional classes are those we would often use to store data read from a file
a data frame (data.frame
or tibble
class)
a matrix (matrix
class)
data.frame
or tibble
, the entire matrix is composed of one R classnumeric
, or all entries are character
lists
.list()
mylist <- list(c("A", "b", "c"), c(1, 2, 3)) mylist
## [[1]] ## [1] "A" "b" "c" ## ## [[2]] ## [1] 1 2 3
class(mylist)
## [1] "list"
There are two most popular R classes used when working with dates and times:
Date
class representing a calendar datePOSIXct
class representing a calendar date with hours, minutes, secondsWe convert data from character to Date
/POSIXct
to use functions to manipulate date/date and time
lubridate
is a powerful, widely used R package from “tidyverse” family to work with Date
/ POSIXct
class objects
Date
class objectclass("2021-06-15")
## [1] "character"
library(lubridate) ymd("2021-06-15") # lubridate package Year Month Day
## [1] "2021-06-15"
class(ymd("2021-06-15")) # lubridate package
## [1] "Date"
class(date("2021-06-15")) # lubridate package
## [1] "Date"
Note for function ymd
: year month day
a <- ymd("2021-06-15") b <- ymd("2021-06-18") a - b
## Time difference of -3 days
Date
class objectdate()
is picky…
date("06/15/2021") # This doesn't work, needs to be year month day
## Error in as.POSIXlt.character(x, tz = tz(x)): character string is not in a standard unambiguous format
mdy
mdy("06/15/2021") # This works
## [1] "2021-06-15"
mdy("06/15/21") # This works
## [1] "2021-06-15"
Note for function mdy
: month day year
Must match the data format!
ymd("06/15/2021") # This doesn't work - gives NA
## Warning: All formats failed to parse. No formats found.
## [1] NA
mdy("06/15/2021") # This works
## [1] "2021-06-15"
POSIXct
class objectclass("2013-01-24 19:39:07")
## [1] "character"
ymd_hms("2013-01-24 19:39:07") # lubridate package
## [1] "2013-01-24 19:39:07 UTC"
class(ymd_hms("2013-01-24 19:39:07")) # lubridate package
## [1] "POSIXct" "POSIXt"
UTC represents time zone, by default: Coordinated Universal Time
Note for function ymd_hms
: year month day hour minute second.
There are functions in case your data have only date, hour and minute (ymd_hm()
) or only date and hour (ymd_h()
).
Note dates are always displayed year month day, even if made with mdy
!
circ_dates <- circ %>% select(date) circ_dates <- circ_dates %>% mutate(date_formatted = mdy(date)) glimpse(circ_dates)
## Rows: 1,146 ## Columns: 2 ## $ date <chr> "01/11/2010", "01/12/2010", "01/13/2010", "01/14/2010",… ## $ date_formatted <date> 2010-01-11, 2010-01-12, 2010-01-13, 2010-01-14, 2010-0…
circ_dates %>% mutate(year = year(date_formatted)) %>% mutate(month = month(date_formatted)) %>% glimpse()
## Rows: 1,146 ## Columns: 4 ## $ date <chr> "01/11/2010", "01/12/2010", "01/13/2010", "01/14/2010",… ## $ date_formatted <date> 2010-01-11, 2010-01-12, 2010-01-13, 2010-01-14, 2010-0… ## $ year <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2… ## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Date
class using ymd()
, mdy()
functions from lubridate
packagePOSIXct
class representing a calendar date with hours, minutes, seconds. Can use ymd_hms()
or ymd_hm()
or ymd_h()
functions from the lubridate
packageDate
or POSIXct
class variables or pull out aspects like year💻 Lab
Image by Gerd Altmann from Pixabay
as.matrix()
creates a matrix from a data frame or tibble (where all values are the same class).
circ_mat <- select(circ, contains("orange")) %>% head(n = 3) circ_mat
## # A tibble: 3 × 3 ## orangeBoardings orangeAlightings orangeAverage ## <dbl> <dbl> <dbl> ## 1 877 1027 952 ## 2 777 815 796 ## 3 1203 1220 1212.
as.matrix(circ_mat)
## orangeBoardings orangeAlightings orangeAverage ## [1,] 877 1027 952.0 ## [2,] 777 815 796.0 ## [3,] 1203 1220 1211.5
matrix()
creates a matrix from scratch.
matrix(1:6, ncol = 2)
## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
List elements can be named
mylist_named <- list( letters = c("A", "b", "c"), numbers = c(1, 2, 3), one_matrix = matrix(1:4, ncol = 2) ) mylist_named
## $letters ## [1] "A" "b" "c" ## ## $numbers ## [1] 1 2 3 ## ## $one_matrix ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4
lubridate
to manipulate Date
objectsx <- ymd(c("2021-06-15", "2021-07-15")) x
## [1] "2021-06-15" "2021-07-15"
day(x) # see also: month(x) , year(x)
## [1] 15 15
x + days(10)
## [1] "2021-06-25" "2021-07-25"
x + months(1) + days(10)
## [1] "2021-07-25" "2021-08-25"
wday(x, label = TRUE)
## [1] Tue Thu ## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
lubridate
to manipulate POSIXct
objectsx <- ymd_hms("2013-01-24 19:39:07") x
## [1] "2013-01-24 19:39:07 UTC"
date(x)
## [1] "2013-01-24"
x + hours(3)
## [1] "2013-01-24 22:39:07 UTC"
floor_date(x, "1 hour") # see also: ceiling_date()
## [1] "2013-01-24 19:00:00 UTC"
x1 <- ymd(c("2021-06-15")) x2 <- ymd(c("2021-07-15")) difftime(x2, x1, units = "weeks")
## Time difference of 4.285714 weeks
as.numeric(difftime(x2, x1, units = "weeks"))
## [1] 4.285714
Similar can be done with time (e.g. difference in hours).
n <- 1:9 n
## [1] 1 2 3 4 5 6 7 8 9
mat <- matrix(n, nrow = 3) mat
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
To get element(s) of a vector (one-dimensional object):
[ ]
x <- c("a", "b", "c", "d", "e", "f", "g", "h") x
## [1] "a" "b" "c" "d" "e" "f" "g" "h"
x[2]
## [1] "b"
x[c(1, 2, 100)]
## [1] "a" "b" NA
Note you cannot use dplyr
functions (like select
) on matrices. To subset matrix rows and/or columns, use matrix[row_index, column_index]
.
mat
## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9
mat[1, 1] # individual entry: row 1, column 1
## [1] 1
mat[1, 2] # individual entry: row 1, column 2
## [1] 4
mat[1, ] # first row
## [1] 1 4 7
mat[, 1] # first column
## [1] 1 2 3
mat[c(1, 2), c(2, 3)] # subset of original matrix: two rows and two columns
## [,1] [,2] ## [1,] 4 7 ## [2,] 5 8
You can reference data from list using $
(if elements are named) or using [[ ]]
mylist_named[[1]]
## [1] "A" "b" "c"
mylist_named[["letters"]] # works only for a list with elements' names
## [1] "A" "b" "c"
mylist_named$letters # works only for a list with elements' names
## [1] "A" "b" "c"