Load the packages needed in this lab. Then, read in the following child mortality data by country. Assign it to the “mort” variable. Change its first column name from ...1
into country
. You can find the data here: https://jhudatascience.org/intro_to_r/data/mortality.csv.
Note that the data has lots of NA
values - don’t worry if you see that.
library(tidyverse)
library(broom)
mort <- read_csv("https://jhudatascience.org/intro_to_r/data/mortality.csv")
## New names:
## Rows: 197 Columns: 255
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (254): 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768,
## 1769, 1770,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
mort <- mort %>%
rename(country = `...1`)
Compute the correlation (with cor
) between the 2006
and 2007
mortality variables. (No need to save this in an object. Just display the result to the screen.) Use the pull()
function to first extract these columns. To use a column name in pull()
that starts with a number, surround it with backticks. Then, use the cor
function.
x <- pull(mort, `2006`)
y <- pull(mort, `2007`)
cor(x, y)
## [1] 0.9995124
Compute the correlation (with cor
) between the 1980
, 1990
, 2000
, and 2010
mortality variables. (No need to save this in an object. Just display the result to the screen.) Use select()
function to first subset the data frame to keep the four columns only. To use a column name in select()
that starts with a number, surround it with backticks. How does this change when we use the use = "complete.obs"
argument?
mort_sub <-
mort %>%
select(`1980`, `1990`, `2000`, `2010`)
cor(mort_sub)
## 1980 1990 2000 2010
## 1980 1.0000000 0.9601540 0.8888411 NA
## 1990 0.9601540 1.0000000 0.9613842 NA
## 2000 0.8888411 0.9613842 1.0000000 NA
## 2010 NA NA NA 1
cor(mort_sub, use = "complete.obs")
## 1980 1990 2000 2010
## 1980 1.0000000 0.9596846 0.8877433 0.8468284
## 1990 0.9596846 1.0000000 0.9610269 0.9247192
## 2000 0.8877433 0.9610269 1.0000000 0.9862345
## 2010 0.8468284 0.9247192 0.9862345 1.0000000
Perform a t-test to determine if there is evidence of a difference between child mortality in 1987
versus 2007
. Use the pull()
function to extract these columns. Print the results using the tidy
function from the broom
package.
x <- pull(mort, `1987`)
y <- pull(mort, `2007`)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 4.8283, df = 348.16, p-value = 2.065e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1991780 0.4729767
## sample estimates:
## mean of x mean of y
## 0.7700633 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.336 0.770 0.434 4.83 0.00000207 348. 0.199 0.473
## # ℹ 2 more variables: method <chr>, alternative <chr>
Perform a t-test to determine if there is evidence of a difference between child mortality in 2006
versus 2007
. Use the pull()
function to extract these columns. Print the results using the tidy
function. How do these results compare to those in question 1.4?
x <- pull(mort, `2006`)
y <- pull(mort, `2007`)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 0.30527, df = 391.32, p-value = 0.7603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09485069 0.12972006
## sample estimates:
## mean of x mean of y
## 0.4514207 0.4339860
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0174 0.451 0.434 0.305 0.760 391. -0.0949 0.130
## # ℹ 2 more variables: method <chr>, alternative <chr>
Read in the Kaggle used car auction dataset (https://www.kaggle.com/datasets/tunguz/used-car-auction-prices). Assign it to the “cars” variable. You can find the data here: http://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv.
cars <- read_csv("http://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv")
## Rows: 72983 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): PurchDate, Auction, Make, Model, Trim, SubModel, Color, Transmissi...
## dbl (10): RefId, IsBadBuy, VehYear, VehicleAge, VehOdo, BYRNO, VNZIP1, VehBC...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and whether it’s an online sale (“IsOnlineSale”) as the predictor. Save the model fit in an object called “lmfit_cars” and display the summary table with summary()
.
# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_cars <- glm(VehBCost ~ IsOnlineSale, data = cars)
summary(lmfit_cars)
##
## Call:
## glm(formula = VehBCost ~ IsOnlineSale, data = cars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6721.220 6.624 1014.622 <2e-16 ***
## IsOnlineSale 384.260 41.664 9.223 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 3121685)
##
## Null deviance: 2.2809e+11 on 72982 degrees of freedom
## Residual deviance: 2.2782e+11 on 72981 degrees of freedom
## AIC: 1298500
##
## Number of Fisher Scoring iterations: 2
Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome and vehicle age (“VehicleAge”) and whether it’s an online sale (“IsOnlineSale”) as predictors. Save the model fit in an object called “lmfit_cars_2” and display the summary table.
lmfit_cars_2 <- glm(VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
summary(lmfit_cars_2)
##
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale, data = cars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8067.47 16.44 490.705 < 2e-16 ***
## VehicleAge -321.80 3.63 -88.639 < 2e-16 ***
## IsOnlineSale 297.31 39.60 7.508 6.07e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2818312)
##
## Null deviance: 2.2809e+11 on 72982 degrees of freedom
## Residual deviance: 2.0568e+11 on 72980 degrees of freedom
## AIC: 1291040
##
## Number of Fisher Scoring iterations: 2
Fit a linear regression model with vehicle cost (“VehBCost”) as the outcome with predictors: (1) vehicle age (“VehicleAge”), (2) whether it’s an online sale (“IsOnlineSale”), and interaction between “VehicleAge” and “IsOnlineSale”.
VehicleAge * IsOnlineSale
in the formula.summary()
.lmfit_cars_3 <- glm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge * IsOnlineSale, data = cars)
summary(lmfit_cars_3)
##
## Call:
## glm(formula = VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge *
## IsOnlineSale, data = cars)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8062.702 16.587 486.085 < 2e-16 ***
## VehicleAge -320.662 3.668 -87.413 < 2e-16 ***
## IsOnlineSale 514.308 107.711 4.775 1.8e-06 ***
## VehicleAge:IsOnlineSale -55.373 25.561 -2.166 0.0303 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2818169)
##
## Null deviance: 2.2809e+11 on 72982 degrees of freedom
## Residual deviance: 2.0567e+11 on 72979 degrees of freedom
## AIC: 1291037
##
## Number of Fisher Scoring iterations: 2
Fit a logistic regression model where the outcome is “bad buy” (“IsBadBuy”) status and predictors are the cost (“VehBCost”) and vehicle age (“VehicleAge”).
summary()
.# General format
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
logfit_cars <- glm(IsBadBuy ~ VehBCost + VehicleAge, data = cars, family = binomial(link = "logit"))
summary(logfit_cars)
##
## Call:
## glm(formula = IsBadBuy ~ VehBCost + VehicleAge, family = binomial(link = "logit"),
## data = cars)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.525e+00 6.379e-02 -39.58 <2e-16 ***
## VehBCost -9.091e-05 6.877e-06 -13.22 <2e-16 ***
## VehicleAge 2.569e-01 6.867e-03 37.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54421 on 72982 degrees of freedom
## Residual deviance: 52264 on 72980 degrees of freedom
## AIC: 52270
##
## Number of Fisher Scoring iterations: 5