Part 1

1.1

Load the packages needed in this lab. Then, read in the following child mortality data by country. Assign it to the “mort” variable. Change its first column name from ...1 into country. You can find the data here: https://jhudatascience.org/intro_to_r/data/mortality.csv.

Note that the data has lots of NA values - don’t worry if you see that.

library(tidyverse)
library(broom)
mort <- read_csv("https://jhudatascience.org/intro_to_r/data/mortality.csv")
## New names:
## Rows: 197 Columns: 255
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (254): 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768,
## 1769, 1770,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
mort <- mort %>%
  rename(country = `...1`)

1.2

Compute the correlation (with cor) between the 2006 and 2007 mortality variables. (No need to save this in an object. Just display the result to the screen.) Use the pull() function to first extract these columns. To use a column name in pull() that starts with a number, surround it with backticks. Then, use the cor function.

x <- pull(mort, `2006`)
y <- pull(mort, `2007`)

cor(x, y)
## [1] 0.9995124

1.3

Compute the correlation (with cor) between the 1980, 1990, 2000, and 2010 mortality variables. (No need to save this in an object. Just display the result to the screen.) Use select() function to first subset the data frame to keep the four columns only. To use a column name in select() that starts with a number, surround it with backticks. How does this change when we use the use = "complete.obs" argument?

mort_sub <-
  mort %>%
  select(`1980`, `1990`, `2000`, `2010`)

cor(mort_sub)
##           1980      1990      2000 2010
## 1980 1.0000000 0.9601540 0.8888411   NA
## 1990 0.9601540 1.0000000 0.9613842   NA
## 2000 0.8888411 0.9613842 1.0000000   NA
## 2010        NA        NA        NA    1
cor(mort_sub, use = "complete.obs")
##           1980      1990      2000      2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000

1.4

Perform a t-test to determine if there is evidence of a difference between child mortality in 1987 versus 2007. Use the pull() function to extract these columns. Print the results using the tidy function from the broom package.

x <- pull(mort, `1987`)
y <- pull(mort, `2007`)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 4.8283, df = 348.16, p-value = 2.065e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1991780 0.4729767
## sample estimates:
## mean of x mean of y 
## 0.7700633 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic    p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>
## 1    0.336     0.770     0.434      4.83 0.00000207      348.    0.199     0.473
## # ℹ 2 more variables: method <chr>, alternative <chr>

Practice on Your Own!

P.1

Perform a t-test to determine if there is evidence of a difference between child mortality in 2006 versus 2007. Use the pull() function to extract these columns. Print the results using the tidy function. How do these results compare to those in question 1.4?

x <- pull(mort, `2006`)
y <- pull(mort, `2007`)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.30527, df = 391.32, p-value = 0.7603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09485069  0.12972006
## sample estimates:
## mean of x mean of y 
## 0.4514207 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   0.0174     0.451     0.434     0.305   0.760      391.  -0.0949     0.130
## # ℹ 2 more variables: method <chr>, alternative <chr>

Part 2

2.1

Read in the Kaggle used car auction dataset (https://www.kaggle.com/datasets/tunguz/used-car-auction-prices). Assign it to the “cars” variable. You can find the data here: http://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv.

cars <- read_csv("http://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv")
## Rows: 72983 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): PurchDate, Auction, Make, Model, Trim, SubModel, Color, Transmissi...
## dbl (10): RefId, IsBadBuy, VehYear, VehicleAge, VehOdo, BYRNO, VNZIP1, VehBC...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and whether it’s an online sale (“IsOnlineSale”) as the predictor. Save the model fit in an object called “lmfit_cars” and display the summary table with summary().

# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_cars <- glm(VehBCost ~ IsOnlineSale, data = cars)
summary(lmfit_cars)
## 
## Call:
## glm(formula = VehBCost ~ IsOnlineSale, data = cars)
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6721.220      6.624 1014.622   <2e-16 ***
## IsOnlineSale  384.260     41.664    9.223   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3121685)
## 
##     Null deviance: 2.2809e+11  on 72982  degrees of freedom
## Residual deviance: 2.2782e+11  on 72981  degrees of freedom
## AIC: 1298500
## 
## Number of Fisher Scoring iterations: 2

2.3

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and vehicle age (“VehicleAge”) and whether it’s an online sale (“IsOnlineSale”) as predictors. Save the model fit in an object called “lmfit_cars_2” and display the summary table.

lmfit_cars_2 <- glm(VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
summary(lmfit_cars_2)
## 
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8067.47      16.44 490.705  < 2e-16 ***
## VehicleAge    -321.80       3.63 -88.639  < 2e-16 ***
## IsOnlineSale   297.31      39.60   7.508 6.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2818312)
## 
##     Null deviance: 2.2809e+11  on 72982  degrees of freedom
## Residual deviance: 2.0568e+11  on 72980  degrees of freedom
## AIC: 1291040
## 
## Number of Fisher Scoring iterations: 2

Practice on Your Own!

P.2

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome with predictors: (1) vehicle age (“VehicleAge”), (2) whether it’s an online sale (“IsOnlineSale”), and interaction between “VehicleAge” and “IsOnlineSale”.

  • To include the interaction, use VehicleAge * IsOnlineSale in the formula.
  • Save the model fit in an object called “lmfit_cars_3” and display the summary table with summary().
lmfit_cars_3 <- glm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge * IsOnlineSale, data = cars)
summary(lmfit_cars_3)
## 
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge * 
##     IsOnlineSale, data = cars)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8062.702     16.587 486.085  < 2e-16 ***
## VehicleAge              -320.662      3.668 -87.413  < 2e-16 ***
## IsOnlineSale             514.308    107.711   4.775  1.8e-06 ***
## VehicleAge:IsOnlineSale  -55.373     25.561  -2.166   0.0303 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2818169)
## 
##     Null deviance: 2.2809e+11  on 72982  degrees of freedom
## Residual deviance: 2.0567e+11  on 72979  degrees of freedom
## AIC: 1291037
## 
## Number of Fisher Scoring iterations: 2

P.3

Fit a logistic regression model where the outcome is “bad buy” (“IsBadBuy”) status and predictors are the cost (“VehBCost”) and vehicle age (“VehicleAge”).

  • Save the model fit in an object called “logfit_cars” and display the summary table with summary().
# General format
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
logfit_cars <- glm(IsBadBuy ~ VehBCost + VehicleAge, data = cars, family = binomial(link = "logit"))
summary(logfit_cars)
## 
## Call:
## glm(formula = IsBadBuy ~ VehBCost + VehicleAge, family = binomial(link = "logit"), 
##     data = cars)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.525e+00  6.379e-02  -39.58   <2e-16 ***
## VehBCost    -9.091e-05  6.877e-06  -13.22   <2e-16 ***
## VehicleAge   2.569e-01  6.867e-03   37.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54421  on 72982  degrees of freedom
## Residual deviance: 52264  on 72980  degrees of freedom
## AIC: 52270
## 
## Number of Fisher Scoring iterations: 5