Part 1
- Load the libraries needed in this lab. Then, read in the following child mortality data using
read_mortality()
function from jhur
package. Assign it to the “mort” variable. Change its first column name from ...1
into country
. You can also find the data here: https://jhudatascience.org/intro_to_R_class/data/mortality.csv
library(dplyr)
library(jhur)
library(broom)
mort <- read_mortality()
mort <- mort %>%
rename(country = `...1`)
- Compute the correlation between the
2006
and 2007
mortality variables. (No need to save this in an object. Just display the result to the screen.) Use the pull()
function to first extract these columns. To use a column name in pull()
that starts with a number, surround it with backticks (`). Then, use the cor
function.
x <- pull(mort, `2006`)
y <- pull(mort, `2007`)
cor(x, y)
## [1] 0.9995124
- Compute the correlation between the
1980
, 1990
, 2000
, and 2010
mortality variables. (No need to save this in an object. Just display the result to the screen.) Use select()
function to first subset the data frame to keep the four columns only. To use a column name in select()
that starts with a number, surround it with backticks (`).
mort_sub <-
mort %>%
select(`1980`, `1990`, `2000`, `2010`)
cor(mort_sub)
## 1980 1990 2000 2010
## 1980 1.0000000 0.9601540 0.8888411 NA
## 1990 0.9601540 1.0000000 0.9613842 NA
## 2000 0.8888411 0.9613842 1.0000000 NA
## 2010 NA NA NA 1
cor(mort_sub, use = "complete.obs")
## 1980 1990 2000 2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000
Part 2
- Perform a t-test to determine if there is evidence of a difference between child mortality in
1987
versus 2007
. Use the pull()
function to extract these columns. Print the results using the tidy
function.
x <- pull(mort, `1987`)
y <- pull(mort, `2007`)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 4.8283, df = 348.16, p-value = 2.065e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1991780 0.4729767
## sample estimates:
## mean of x mean of y
## 0.7700633 0.4339860
tidy( t.test(x, y) )
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.336 0.770 0.434 4.83 0.00000207 348. 0.199 0.473
## # … with 2 more variables: method <chr>, alternative <chr>
- Perform a t-test to determine if there is evidence of a difference between child mortality in
2006
versus 2007
. Use the pull()
function to extract these columns. Print the results using the tidy
function. How do these results compare to those in #4?
x <- pull(mort, `2006`)
y <- pull(mort, `2007`)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 0.30527, df = 391.32, p-value = 0.7603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09485069 0.12972006
## sample estimates:
## mean of x mean of y
## 0.4514207 0.4339860
tidy( t.test(x, y) )
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0174 0.451 0.434 0.305 0.760 391. -0.0949 0.130
## # … with 2 more variables: method <chr>, alternative <chr>
Part 3
- Read in the Kaggle cars auction dataset using
read_kaggle()
from the jhur
package. Assign it to the “cars” variable. You can also find the data here: http://jhudatascience.org/intro_to_R_class/data/kaggleCarAuction.csv.
cars <- read_kaggle()
- Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and whether it’s an online sale (“IsOnlineSale”) as the predictor. Save the model fit in an object called “lmfit_cars” and display the summary table.
lmfit_cars <- lm(VehBCost ~ IsOnlineSale, data = cars)
summary(lmfit_cars)
##
## Call:
## lm(formula = VehBCost ~ IsOnlineSale, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6720 -1296 -21 1179 38748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6721.220 6.624 1014.622 <2e-16 ***
## IsOnlineSale 384.260 41.664 9.223 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1767 on 72981 degrees of freedom
## Multiple R-squared: 0.001164, Adjusted R-squared: 0.00115
## F-statistic: 85.06 on 1 and 72981 DF, p-value: < 2.2e-16
- Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and vehicle age (“VehicleAge”) and whether it’s an online sale (“IsOnlineSale”) as predictors. Save the model fit in an object called “lmfit_cars_2” and display the summary table.
lmfit_cars_2 <- lm(VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
summary(lmfit_cars_2)
##
## Call:
## lm(formula = VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6877 -1242 -139 998 38045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8067.47 16.44 490.705 < 2e-16 ***
## VehicleAge -321.80 3.63 -88.639 < 2e-16 ***
## IsOnlineSale 297.31 39.60 7.508 6.07e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1679 on 72980 degrees of freedom
## Multiple R-squared: 0.09825, Adjusted R-squared: 0.09822
## F-statistic: 3976 on 2 and 72980 DF, p-value: < 2.2e-16
- Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome with predictors: (1) vehicle age (“VehicleAge”), (2) whether it’s an online sale (“IsOnlineSale”), and interaction between “VehicleAge” and “IsOnlineSale”.
- To include the interaction, use
VehicleAge * IsOnlineSale
in the formula.
- Save the model fit in an object called “lmfit_cars_3” and display the summary table.
lmfit_cars_3 <- lm(VehBCost ~ VehicleAge * IsOnlineSale, data = cars)
summary(lmfit_cars_3)
##
## Call:
## lm(formula = VehBCost ~ VehicleAge * IsOnlineSale, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6876 -1241 -140 999 38048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8062.702 16.587 486.085 < 2e-16 ***
## VehicleAge -320.662 3.668 -87.413 < 2e-16 ***
## IsOnlineSale 514.308 107.711 4.775 1.8e-06 ***
## VehicleAge:IsOnlineSale -55.373 25.561 -2.166 0.0303 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1679 on 72979 degrees of freedom
## Multiple R-squared: 0.0983, Adjusted R-squared: 0.09827
## F-statistic: 2652 on 3 and 72979 DF, p-value: < 2.2e-16
- Fit a logistic regression model where the outcome is “bad buy” (“IsBadBuy”) status and predictors are the cost (“VehBCost”) and vehicle age (“VehicleAge”).
- Save the model fit in an object called “logfit_cars” and display the summary table.
logfit_cars <- glm(IsBadBuy ~ VehBCost + VehicleAge, data = cars, family = binomial())
summary(logfit_cars)
##
## Call:
## glm(formula = IsBadBuy ~ VehBCost + VehicleAge, family = binomial(),
## data = cars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0368 -0.5387 -0.4523 -0.3705 3.5062
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.525e+00 6.379e-02 -39.58 <2e-16 ***
## VehBCost -9.091e-05 6.877e-06 -13.22 <2e-16 ***
## VehicleAge 2.569e-01 6.867e-03 37.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54421 on 72982 degrees of freedom
## Residual deviance: 52264 on 72980 degrees of freedom
## AIC: 52270
##
## Number of Fisher Scoring iterations: 5