Part 1

1.1

Load the packages needed in this lab. Then, read in the following child mortality data by country. Assign it to the “mort” variable. Use clean_names() from the janitor package to reformat some column names, then change the first column name from x1 to country. You can find the data here: https://jhudatascience.org/intro_to_r/data/mortality.csv.

Note that the data has lots of NA values - don’t worry if you see that.

install.packages("janitor", repos='http://cran.us.r-project.org')
library(tidyverse)
library(broom)
library(janitor)
mort <- read_csv("https://jhudatascience.org/intro_to_r/data/mortality.csv") %>%
  clean_names()
## New names:
## Rows: 197 Columns: 255
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (254): 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768,
## 1769, 1770,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
mort <- mort %>%
  rename(country = x1)

1.2

Compute the correlation (with cor) between the x2006 and x2007 mortality variables. (No need to save this in an object. Just display the result to the screen.) Use the pull() function to first extract these columns. Then, use the cor function.

x <- pull(mort, x2006)
y <- pull(mort, x2007)

cor(x, y)
## [1] 0.9995124

1.3

Compute the correlation (with cor) between the x1980, x1990, x2000, and x2010 mortality variables. (No need to save this in an object. Just display the result to the screen.) Use select() function to first subset the data frame to keep the four columns only. How does this change when we use the use = "complete.obs" argument?

mort_sub <-
  mort %>%
  select(x1980, x1990, x2000, x2010)

cor(mort_sub)
##           x1980     x1990     x2000 x2010
## x1980 1.0000000 0.9601540 0.8888411    NA
## x1990 0.9601540 1.0000000 0.9613842    NA
## x2000 0.8888411 0.9613842 1.0000000    NA
## x2010        NA        NA        NA     1
cor(mort_sub, use = "complete.obs")
##           x1980     x1990     x2000     x2010
## x1980 1.0000000 0.9596846 0.8877433 0.8468284
## x1990 0.9596846 1.0000000 0.9610269 0.9247192
## x2000 0.8877433 0.9610269 1.0000000 0.9862345
## x2010 0.8468284 0.9247192 0.9862345 1.0000000

1.4

Perform a t-test to determine if there is evidence of a difference between child mortality in x1987 versus x2007. Use the pull() function to extract these columns. Print the results using the tidy function from the broom package.

x <- pull(mort, x1987)
y <- pull(mort, x2007)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 4.8283, df = 348.16, p-value = 2.065e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1991780 0.4729767
## sample estimates:
## mean of x mean of y 
## 0.7700633 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic    p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>      <dbl>     <dbl>    <dbl>     <dbl>
## 1    0.336     0.770     0.434      4.83 0.00000207      348.    0.199     0.473
## # ℹ 2 more variables: method <chr>, alternative <chr>

Practice on Your Own!

P.1

Perform a t-test to determine if there is evidence of a difference between child mortality in x2006 versus x2007. Use the pull() function to extract these columns. Print the results using the tidy function. How do these results compare to those in question 1.4?

x <- pull(mort, x2006)
y <- pull(mort, x2007)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.30527, df = 391.32, p-value = 0.7603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09485069  0.12972006
## sample estimates:
## mean of x mean of y 
## 0.4514207 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   0.0174     0.451     0.434     0.305   0.760      391.  -0.0949     0.130
## # ℹ 2 more variables: method <chr>, alternative <chr>

Part 2

2.1

Read in the Kaggle used car auction dataset (https://www.kaggle.com/datasets/tunguz/used-car-auction-prices). Assign it to the “cars” variable. You can find the data here: http://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv.

cars <- read_csv("http://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv")
## Rows: 72983 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): PurchDate, Auction, Make, Model, Trim, SubModel, Color, Transmissi...
## dbl (10): RefId, IsBadBuy, VehYear, VehicleAge, VehOdo, BYRNO, VNZIP1, VehBC...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and whether it’s an online sale (“IsOnlineSale”) as the predictor. Save the model fit in an object called “lmfit_cars” and display the summary table with summary().

# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_cars <- glm(VehBCost ~ IsOnlineSale, data = cars)
summary(lmfit_cars)
## 
## Call:
## glm(formula = VehBCost ~ IsOnlineSale, data = cars)
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6721.220      6.624 1014.622   <2e-16 ***
## IsOnlineSale  384.260     41.664    9.223   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3121685)
## 
##     Null deviance: 2.2809e+11  on 72982  degrees of freedom
## Residual deviance: 2.2782e+11  on 72981  degrees of freedom
## AIC: 1298500
## 
## Number of Fisher Scoring iterations: 2

2.3

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and vehicle age (“VehicleAge”) and whether it’s an online sale (“IsOnlineSale”) as predictors. Save the model fit in an object called “lmfit_cars_2” and display the summary table.

lmfit_cars_2 <- glm(VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
summary(lmfit_cars_2)
## 
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8067.47      16.44 490.705  < 2e-16 ***
## VehicleAge    -321.80       3.63 -88.639  < 2e-16 ***
## IsOnlineSale   297.31      39.60   7.508 6.07e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2818312)
## 
##     Null deviance: 2.2809e+11  on 72982  degrees of freedom
## Residual deviance: 2.0568e+11  on 72980  degrees of freedom
## AIC: 1291040
## 
## Number of Fisher Scoring iterations: 2

Practice on Your Own!

P.2

Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome with predictors: (1) vehicle age (“VehicleAge”), (2) whether it’s an online sale (“IsOnlineSale”), and interaction between “VehicleAge” and “IsOnlineSale”.

  • To include the interaction, use VehicleAge * IsOnlineSale in the formula.
  • Save the model fit in an object called “lmfit_cars_3” and display the summary table with summary().
lmfit_cars_3 <- glm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge * IsOnlineSale, data = cars)
summary(lmfit_cars_3)
## 
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge * 
##     IsOnlineSale, data = cars)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8062.702     16.587 486.085  < 2e-16 ***
## VehicleAge              -320.662      3.668 -87.413  < 2e-16 ***
## IsOnlineSale             514.308    107.711   4.775  1.8e-06 ***
## VehicleAge:IsOnlineSale  -55.373     25.561  -2.166   0.0303 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2818169)
## 
##     Null deviance: 2.2809e+11  on 72982  degrees of freedom
## Residual deviance: 2.0567e+11  on 72979  degrees of freedom
## AIC: 1291037
## 
## Number of Fisher Scoring iterations: 2

P.3

Fit a logistic regression model where the outcome is “bad buy” (“IsBadBuy”) status and predictors are the cost (“VehBCost”) and vehicle age (“VehicleAge”).

  • Save the model fit in an object called “logfit_cars” and display the summary table with summary().
# General format
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
logfit_cars <- glm(IsBadBuy ~ VehBCost + VehicleAge, data = cars, family = binomial(link = "logit"))
summary(logfit_cars)
## 
## Call:
## glm(formula = IsBadBuy ~ VehBCost + VehicleAge, family = binomial(link = "logit"), 
##     data = cars)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.525e+00  6.379e-02  -39.58   <2e-16 ***
## VehBCost    -9.091e-05  6.877e-06  -13.22   <2e-16 ***
## VehicleAge   2.569e-01  6.867e-03   37.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54421  on 72982  degrees of freedom
## Residual deviance: 52264  on 72980  degrees of freedom
## AIC: 52270
## 
## Number of Fisher Scoring iterations: 5