Part 1

Load the libraries needed in this lab. Then, read in the following child mortality data using read_mortality() function from jhur package. Assign it to the “mort” variable. Change its first column name from ...1 into country. You can also find the data here: https://jhudatascience.org/intro_to_R_class/data/mortality.csv

library(dplyr)
library(jhur)
library(broom)

mort <- read_mortality()

mort <- mort %>% 
  rename(country = `...1`)

Compute the correlation between the 2006 and 2007 mortality variables. (No need to save this in an object. Just display the result to the screen.) Use the pull() function to first extract these columns. To use a column name in pull() that starts with a number, surround it with backticks (`). Then, use the cor function.

x <- pull(mort, `2006`)
y <- pull(mort, `2007`)

cor(x, y)

## [1] 0.9995124

Compute the correlation between the 1980, 1990, 2000, and 2010 mortality variables. (No need to save this in an object. Just display the result to the screen.) Use select() function to first subset the data frame to keep the four columns only. To use a column name in select() that starts with a number, surround it with backticks (`).

mort_sub <- 
  mort %>% 
  select(`1980`, `1990`, `2000`, `2010`)

cor(mort_sub)

##           1980      1990      2000 2010
## 1980 1.0000000 0.9601540 0.8888411   NA
## 1990 0.9601540 1.0000000 0.9613842   NA
## 2000 0.8888411 0.9613842 1.0000000   NA
## 2010        NA        NA        NA    1

cor(mort_sub, use = "complete.obs")

##           1980      1990      2000      2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000

Part 2

Perform a t-test to determine if there is evidence of a difference between child mortality in 1987 versus 2007. Use the pull() function to extract these columns. Print the results using the tidy function.

x <- pull(mort, `1987`)
y <- pull(mort, `2007`)

t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 4.8283, df = 348.16, p-value = 2.065e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1991780 0.4729767
## sample estimates:
## mean of x mean of y 
## 0.7700633 0.4339860

tidy( t.test(x, y) )

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic    p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>
## 1    0.336     0.770     0.434      4.83 0.00000207      348.    0.199     0.473
## # … with 2 more variables: method <chr>, alternative <chr>

Perform a t-test to determine if there is evidence of a difference between child mortality in 2006 versus 2007. Use the pull() function to extract these columns. Print the results using the tidy function. How do these results compare to those in #4?

x <- pull(mort, `2006`)
y <- pull(mort, `2007`)

t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.30527, df = 391.32, p-value = 0.7603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09485069  0.12972006
## sample estimates:
## mean of x mean of y 
## 0.4514207 0.4339860

tidy( t.test(x, y) )

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   0.0174     0.451     0.434     0.305   0.760      391.  -0.0949     0.130
## # … with 2 more variables: method <chr>, alternative <chr>

Part 3

Read in the Kaggle cars auction dataset using read_kaggle() from the jhur package. Assign it to the “cars” variable. You can also find the data here: http://jhudatascience.org/intro_to_R_class/data/kaggleCarAuction.csv.

cars <- read_kaggle()

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and whether it’s an online sale (“IsOnlineSale”) as the predictor. Save the model fit in an object called “lmfit_cars” and display the summary table.

lmfit_cars <- lm(VehBCost ~ IsOnlineSale, data = cars)
summary(lmfit_cars)

## 
## Call:
## lm(formula = VehBCost ~ IsOnlineSale, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6720  -1296    -21   1179  38748 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6721.220      6.624 1014.622   <2e-16 ***
## IsOnlineSale  384.260     41.664    9.223   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1767 on 72981 degrees of freedom
## Multiple R-squared:  0.001164,   Adjusted R-squared:  0.00115 
## F-statistic: 85.06 on 1 and 72981 DF,  p-value: < 2.2e-16

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and vehicle age (“VehicleAge”) and whether it’s an online sale (“IsOnlineSale”) as predictors. Save the model fit in an object called “lmfit_cars_2” and display the summary table.

lmfit_cars_2 <- lm(VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
summary(lmfit_cars_2)

## 
## Call:
## lm(formula = VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6877  -1242   -139    998  38045 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8067.47      16.44 490.705  < 2e-16 ***
## VehicleAge    -321.80       3.63 -88.639  < 2e-16 ***
## IsOnlineSale   297.31      39.60   7.508 6.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1679 on 72980 degrees of freedom
## Multiple R-squared:  0.09825,    Adjusted R-squared:  0.09822 
## F-statistic:  3976 on 2 and 72980 DF,  p-value: < 2.2e-16

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome with predictors: (1) vehicle age (“VehicleAge”), (2) whether it’s an online sale (“IsOnlineSale”), and interaction between “VehicleAge” and “IsOnlineSale”.

To include the interaction, use VehicleAge * IsOnlineSale in the formula.
Save the model fit in an object called “lmfit_cars_3” and display the summary table.

lmfit_cars_3 <- lm(VehBCost ~ VehicleAge * IsOnlineSale, data = cars)
summary(lmfit_cars_3)

## 
## Call:
## lm(formula = VehBCost ~ VehicleAge * IsOnlineSale, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6876  -1241   -140    999  38048 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8062.702     16.587 486.085  < 2e-16 ***
## VehicleAge              -320.662      3.668 -87.413  < 2e-16 ***
## IsOnlineSale             514.308    107.711   4.775  1.8e-06 ***
## VehicleAge:IsOnlineSale  -55.373     25.561  -2.166   0.0303 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1679 on 72979 degrees of freedom
## Multiple R-squared:  0.0983, Adjusted R-squared:  0.09827 
## F-statistic:  2652 on 3 and 72979 DF,  p-value: < 2.2e-16

Fit a logistic regression model where the outcome is “bad buy” (“IsBadBuy”) status and predictors are the cost (“VehBCost”) and vehicle age (“VehicleAge”).

Save the model fit in an object called “logfit_cars” and display the summary table.

logfit_cars <- glm(IsBadBuy ~ VehBCost + VehicleAge, data = cars, family = binomial())
summary(logfit_cars)

## 
## Call:
## glm(formula = IsBadBuy ~ VehBCost + VehicleAge, family = binomial(), 
##     data = cars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0368  -0.5387  -0.4523  -0.3705   3.5062  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.525e+00  6.379e-02  -39.58   <2e-16 ***
## VehBCost    -9.091e-05  6.877e-06  -13.22   <2e-16 ***
## VehicleAge   2.569e-01  6.867e-03   37.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54421  on 72982  degrees of freedom
## Residual deviance: 52264  on 72980  degrees of freedom
## AIC: 52270
## 
## Number of Fisher Scoring iterations: 5

Statistics Lab Key

Part 1

Part 2

Part 3