Load the libraries needed in this lab. Then, read in the following
child mortality data using read_mortality()
function from
jhur
package. Assign it to the “mort” variable. Change its
first column name from ...1
into country
. You
can also find the data here: https://jhudatascience.org/intro_to_r/data/mortality.csv
Note that the data has lots of NA
values - don’t worry
if you see that.
library(dplyr)
library(jhur)
library(broom)
mort <- read_mortality()
## New names:
## Rows: 197 Columns: 255
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (254): 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768,
## 1769, 1770,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
mort <- mort %>%
rename(country = `...1`)
Compute the correlation (with cor
) between the
2006
and 2007
mortality variables. (No need to
save this in an object. Just display the result to the screen.) Use the
pull()
function to first extract these columns. To use a
column name in pull()
that starts with a number, surround
it with backticks. Then, use the cor
function.
x <- pull(mort, `2006`)
y <- pull(mort, `2007`)
cor(x, y)
## [1] 0.9995124
Compute the correlation (with cor
) between the
1980
, 1990
, 2000
, and
2010
mortality variables. (No need to save this in an
object. Just display the result to the screen.) Use
select()
function to first subset the data frame to keep
the four columns only. To use a column name in select()
that starts with a number, surround it with backticks. How does this
change when we use the use = "complete.obs"
argument?
mort_sub <-
mort %>%
select(`1980`, `1990`, `2000`, `2010`)
cor(mort_sub)
## 1980 1990 2000 2010
## 1980 1.0000000 0.9601540 0.8888411 NA
## 1990 0.9601540 1.0000000 0.9613842 NA
## 2000 0.8888411 0.9613842 1.0000000 NA
## 2010 NA NA NA 1
cor(mort_sub, use = "complete.obs")
## 1980 1990 2000 2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000
Perform a t-test to determine if there is evidence of a difference
between child mortality in 1987
versus 2007
.
Use the pull()
function to extract these columns. Print the
results using the tidy
function from the broom
package.
x <- pull(mort, `1987`)
y <- pull(mort, `2007`)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 4.8283, df = 348.16, p-value = 2.065e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1991780 0.4729767
## sample estimates:
## mean of x mean of y
## 0.7700633 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.336 0.770 0.434 4.83 0.00000207 348. 0.199 0.473
## # ℹ 2 more variables: method <chr>, alternative <chr>
Perform a t-test to determine if there is evidence of a difference
between child mortality in 2006
versus 2007
.
Use the pull()
function to extract these columns. Print the
results using the tidy
function. How do these results
compare to those in question 1.4?
x <- pull(mort, `2006`)
y <- pull(mort, `2007`)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 0.30527, df = 391.32, p-value = 0.7603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09485069 0.12972006
## sample estimates:
## mean of x mean of y
## 0.4514207 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0174 0.451 0.434 0.305 0.760 391. -0.0949 0.130
## # ℹ 2 more variables: method <chr>, alternative <chr>
Read in the Kaggle cars auction dataset using
read_kaggle()
from the jhur
package. Assign it
to the “cars” variable. You can also find the data here: http://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv.
cars <- read_kaggle()
Fit a linear regression model with vehicle cost (“VehBCost”) as the
outcome and whether it’s an online sale (“IsOnlineSale”) as the
predictor. Save the model fit in an object called “lmfit_cars” and
display the summary table with summary()
.
# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_cars <- glm(VehBCost ~ IsOnlineSale, data = cars)
summary(lmfit_cars)
##
## Call:
## glm(formula = VehBCost ~ IsOnlineSale, data = cars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6721.220 6.624 1014.622 <2e-16 ***
## IsOnlineSale 384.260 41.664 9.223 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3121685)
##
## Null deviance: 2.2809e+11 on 72982 degrees of freedom
## Residual deviance: 2.2782e+11 on 72981 degrees of freedom
## AIC: 1298500
##
## Number of Fisher Scoring iterations: 2
Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and vehicle age (“VehicleAge”) and whether it’s an online sale (“IsOnlineSale”) as predictors. Save the model fit in an object called “lmfit_cars_2” and display the summary table.
lmfit_cars_2 <- glm(VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
summary(lmfit_cars_2)
##
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8067.47 16.44 490.705 < 2e-16 ***
## VehicleAge -321.80 3.63 -88.639 < 2e-16 ***
## IsOnlineSale 297.31 39.60 7.508 6.07e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2818312)
##
## Null deviance: 2.2809e+11 on 72982 degrees of freedom
## Residual deviance: 2.0568e+11 on 72980 degrees of freedom
## AIC: 1291040
##
## Number of Fisher Scoring iterations: 2
Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome with predictors: (1) vehicle age (“VehicleAge”), (2) whether it’s an online sale (“IsOnlineSale”), and interaction between “VehicleAge” and “IsOnlineSale”.
VehicleAge * IsOnlineSale
in the formula.summary()
.lmfit_cars_3 <- glm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge * IsOnlineSale, data = cars)
summary(lmfit_cars_3)
##
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge *
## IsOnlineSale, data = cars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8062.702 16.587 486.085 < 2e-16 ***
## VehicleAge -320.662 3.668 -87.413 < 2e-16 ***
## IsOnlineSale 514.308 107.711 4.775 1.8e-06 ***
## VehicleAge:IsOnlineSale -55.373 25.561 -2.166 0.0303 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2818169)
##
## Null deviance: 2.2809e+11 on 72982 degrees of freedom
## Residual deviance: 2.0567e+11 on 72979 degrees of freedom
## AIC: 1291037
##
## Number of Fisher Scoring iterations: 2
Fit a logistic regression model where the outcome is “bad buy” (“IsBadBuy”) status and predictors are the cost (“VehBCost”) and vehicle age (“VehicleAge”).
summary()
.# General format
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
logfit_cars <- glm(IsBadBuy ~ VehBCost + VehicleAge, data = cars, family = binomial(link = "logit"))
summary(logfit_cars)
##
## Call:
## glm(formula = IsBadBuy ~ VehBCost + VehicleAge, family = binomial(link = "logit"),
## data = cars)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.525e+00 6.379e-02 -39.58 <2e-16 ***
## VehBCost -9.091e-05 6.877e-06 -13.22 <2e-16 ***
## VehicleAge 2.569e-01 6.867e-03 37.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54421 on 72982 degrees of freedom
## Residual deviance: 52264 on 72980 degrees of freedom
## AIC: 52270
##
## Number of Fisher Scoring iterations: 5