Please submit your filled out .Rmd
file to the assignment drop box on CoursePlus: https://courseplus.jhu.edu/core/index.cfm/go/db:assignment.submit/assignmentID/40988/coID/16733/
All assignments are due by the end of the grading period for this term (Wednesday, Jan 26, 2022 at 11:59pm EST). Please reach out to us if you need help before this time!
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
1. (a) Make a variable “bday”. Assign it your birthday in day-month format (1-Jan). (b) Make a variable “name”. Assign it your name as a character string.
bday <- "19-Feb"
name <- "Bruce Wayne"
2. Make a variable “me” that is “bday” and “name” combined.
me <- c(bday, name)
3. Determine the data class for “me”.
class(me)
## [1] "character"
The class for “me” is character.
4. If I want to do me / 2
I get the following error: Error in me/2 : non-numeric argument to binary operator
. Why? Write your answer as a comment inside the R chunk below.
# R cannot perform math functions on character data.
The following questions involve an outside dataset.
We will be working with a dataset from the “Kaggle” website, which hosts competitions for prediction and machine learning. More details on this dataset are here: https://www.kaggle.com/c/DontGetKicked/overview/background.
5. Bring the dataset into R. The dataset is located at: https://jhudatascience.org/intro_to_R_class/data/kaggleCarAuction.csv. You can use the link, download it, or use whatever method you like for getting the file. Once you get the file, read the dataset in using read_csv()
and assign it the name “cars”.
cars <- read_csv("https://jhudatascience.org/intro_to_R_class/data/kaggleCarAuction.csv")
## Rows: 72983 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): PurchDate, Auction, Make, Model, Trim, SubModel, Color, Transmissi...
## dbl (10): RefId, IsBadBuy, VehYear, VehicleAge, VehOdo, BYRNO, VNZIP1, VehBC...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# OR
url <- "https://jhudatascience.org/intro_to_R_class/data/kaggleCarAuction.csv"
cars <- read_csv(url)
## Rows: 72983 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): PurchDate, Auction, Make, Model, Trim, SubModel, Color, Transmissi...
## dbl (10): RefId, IsBadBuy, VehYear, VehicleAge, VehOdo, BYRNO, VNZIP1, VehBC...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# OR
download.file("https://jhudatascience.org/intro_to_R_class/data/kaggleCarAuction.csv",
destfile = "cars_data.csv",
overwrite = TRUE,
mode = "wb")
cars <- read_csv("cars_data.csv")
## Rows: 72983 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): PurchDate, Auction, Make, Model, Trim, SubModel, Color, Transmissi...
## dbl (10): RefId, IsBadBuy, VehYear, VehicleAge, VehOdo, BYRNO, VNZIP1, VehBC...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
6. Import the data “dictionary” from https://jhudatascience.org/intro_to_R_class/data/Carvana_Data_Dictionary_formatted.txt. Use the read_tsv()
function and assign it the name “key”.
key <- read_tsv("https://jhudatascience.org/intro_to_R_class/data/Carvana_Data_Dictionary_formatted.txt")
## Rows: 36 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Field Name, Definition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# OR
download.file("https://jhudatascience.org/intro_to_R_class/data/Carvana_Data_Dictionary_formatted.txt",
destfile = "dict.txt",
overwrite = TRUE,
mode = "wb")
key <- read_tsv("dict.txt")
## Rows: 36 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): Field Name, Definition
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
7. R can save individual variables as .rds
files that can be imported again later. Save the “cars” data in an .rds file using the write_rds()
function.
write_rds(cars, file = "kaggle.rds")
8. You should now be ready to work with the “cars” dataset.
spec()
or typeof()
. Write your answer as a comment inside the R chunk below.spec(cars)
## cols(
## RefId = col_double(),
## IsBadBuy = col_double(),
## PurchDate = col_character(),
## Auction = col_character(),
## VehYear = col_double(),
## VehicleAge = col_double(),
## Make = col_character(),
## Model = col_character(),
## Trim = col_character(),
## SubModel = col_character(),
## Color = col_character(),
## Transmission = col_character(),
## WheelTypeID = col_character(),
## WheelType = col_character(),
## VehOdo = col_double(),
## Nationality = col_character(),
## Size = col_character(),
## TopThreeAmericanName = col_character(),
## MMRAcquisitionAuctionAveragePrice = col_character(),
## MMRAcquisitionAuctionCleanPrice = col_character(),
## MMRAcquisitionRetailAveragePrice = col_character(),
## MMRAcquisitonRetailCleanPrice = col_character(),
## MMRCurrentAuctionAveragePrice = col_character(),
## MMRCurrentAuctionCleanPrice = col_character(),
## MMRCurrentRetailAveragePrice = col_character(),
## MMRCurrentRetailCleanPrice = col_character(),
## PRIMEUNIT = col_character(),
## AUCGUART = col_character(),
## BYRNO = col_double(),
## VNZIP1 = col_double(),
## VNST = col_character(),
## VehBCost = col_double(),
## IsOnlineSale = col_double(),
## WarrantyCost = col_double()
## )
# RefId = col_double()
# IsBadBuy = col_double()
# PurchDate = col_character()
9. How many cars (rows) are in the dataset? How many variables (columns) are recorded for each car?
dim(cars)
## [1] 72983 34
nrow(cars)
## [1] 72983
There are 72983 cars in the dataset and 34 variables.
10. Filter out (i.e., remove) any vehicles that cost less than or equal to $5000 (“VehBCost”) or that have missing values. Reassign the new filtered dataset to “cars”. How many vehicles are left after filtering?
Hint: The filter()
function also removes missing values.
cars <- filter(cars, VehBCost > 5000)
nrow(cars)
## [1] 59957
# OR
cars <- cars %>% filter(VehBCost > 5000)
nrow(cars)
## [1] 59957
There are 59957 cars after filtering by price.
11. From this point on, work with the filtered “cars” dataset from the above question. Given the average car loan today is 70 months, create a new variable (column) called “MonthlyPrice” that shows the monthly cost for each car (Divide “VehBCost” by 70). Check to make sure the new column is there.
Hint: use the mutate()
function.
cars <- cars %>% mutate(MonthlyPrice = VehBCost / 70)
# OR
cars <- mutate(cars, MonthlyPrice = VehBCost / 70)
colnames(cars)
## [1] "RefId" "IsBadBuy"
## [3] "PurchDate" "Auction"
## [5] "VehYear" "VehicleAge"
## [7] "Make" "Model"
## [9] "Trim" "SubModel"
## [11] "Color" "Transmission"
## [13] "WheelTypeID" "WheelType"
## [15] "VehOdo" "Nationality"
## [17] "Size" "TopThreeAmericanName"
## [19] "MMRAcquisitionAuctionAveragePrice" "MMRAcquisitionAuctionCleanPrice"
## [21] "MMRAcquisitionRetailAveragePrice" "MMRAcquisitonRetailCleanPrice"
## [23] "MMRCurrentAuctionAveragePrice" "MMRCurrentAuctionCleanPrice"
## [25] "MMRCurrentRetailAveragePrice" "MMRCurrentRetailCleanPrice"
## [27] "PRIMEUNIT" "AUCGUART"
## [29] "BYRNO" "VNZIP1"
## [31] "VNST" "VehBCost"
## [33] "IsOnlineSale" "WarrantyCost"
## [35] "MonthlyPrice"
12. What is the range of the manufacture year (“VehYear”) of the vehicles?
range( cars %>% pull(VehYear) )
## [1] 2001 2010
# OR
car_range <- pull(cars, VehYear)
range(car_range)
## [1] 2001 2010
# OR
range(cars$VehYear)
## [1] 2001 2010
table(cars$VehYear)
##
## 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## 532 1741 3859 7262 12296 15494 11079 6871 822 1
The range is 2001, 2010.
13. How many cars were from before 2004? What percent/proportion do these represent? Use:
filter()
and nrow()
group_by()
and summarize()
orsum()
# How many cars were from before 2004?
nrow(cars %>% filter(VehYear < 2004))
## [1] 6132
# OR
cars %>% group_by(VehYear < 2004) %>% summarize(total = n())
## # A tibble: 2 x 2
## `VehYear < 2004` total
## <lgl> <int>
## 1 FALSE 53825
## 2 TRUE 6132
# OR
sum(cars$VehYear < 2004)
## [1] 6132
# OR
table(cars$VehYear < 2004)
##
## FALSE TRUE
## 53825 6132
# what percent/proportion do these represent?
nrow(cars %>% filter(VehYear < 2004)) / nrow(cars)
## [1] 0.1022733
# OR
mean(cars$VehYear < 2004)
## [1] 0.1022733
The percent of cars were from before 2004 is 10.2273296.
14. How many different vehicle manufacturers/makes (“Make”) are there?
Hint: use length()
with unique()
or table()
. Remember to pull()
the right column.
cars %>% pull(Make) %>% unique() %>% length()
## [1] 32
# OR
length(unique(cars %>% pull(Make)))
## [1] 32
# OR
length(unique(cars$Make))
## [1] 32
# OR
length(table(cars$Make))
## [1] 32
32 unique makes.
15. How many different vehicle models (“Model”) are there?
cars %>% pull(Model) %>% unique() %>% length()
## [1] 985
985 unique models.
16. Which vehicle color group had the highest mean acquisition cost paid for the vehicle at time of purchase, and what was this cost?
Hint: Use group_by()
with summarize()
. To determine the column that corresponds to “acquisition cost paid for the vehicle at time of purchase”, check the “key” corresponding to the data dictionary that you imported above.
cars %>% group_by(Color) %>%
summarize(mean = mean(VehBCost))
## # A tibble: 17 x 2
## Color mean
## <chr> <dbl>
## 1 BEIGE 7317.
## 2 BLACK 7538.
## 3 BLUE 7182.
## 4 BROWN 7509.
## 5 GOLD 7052.
## 6 GREEN 7089.
## 7 GREY 7551.
## 8 MAROON 7220.
## 9 NOT AVAIL 7151.
## 10 NULL 5860
## 11 ORANGE 7135.
## 12 OTHER 7429.
## 13 PURPLE 6889.
## 14 RED 7279.
## 15 SILVER 7175.
## 16 WHITE 7201.
## 17 YELLOW 6922.
17. Extend on the code you wrote for question 16. Use the arrange()
function to sort the output by mean acquisition cost.
cars %>% group_by(Color) %>%
summarize(mean = mean(VehBCost)) %>%
arrange(desc(mean))
## # A tibble: 17 x 2
## Color mean
## <chr> <dbl>
## 1 GREY 7551.
## 2 BLACK 7538.
## 3 BROWN 7509.
## 4 OTHER 7429.
## 5 BEIGE 7317.
## 6 RED 7279.
## 7 MAROON 7220.
## 8 WHITE 7201.
## 9 BLUE 7182.
## 10 SILVER 7175.
## 11 NOT AVAIL 7151.
## 12 ORANGE 7135.
## 13 GREEN 7089.
## 14 GOLD 7052.
## 15 YELLOW 6922.
## 16 PURPLE 6889.
## 17 NULL 5860
18. How many vehicles were red and have fewer than 30,000 miles? To determine the column that corresponds to mileage (odometer reading), check the “key” corresponding to the data dictionary that you imported above. use:
filter()
and count()
filter()
and tally()
orsum()
cars %>% filter(Color == "RED" & VehOdo < 30000) %>% tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 24
# OR
cars %>% filter(Color == "RED" & VehOdo < 30000) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 24
# OR
sum(cars$Color == "RED" & cars$VehOdo < 30000)
## [1] 24
A total of 24 cars.
19. How many vehicles are blue or red? use:
filter()
and count()
filter()
and tally()
orsum()
cars %>% filter(Color == "BLUE" | Color == "RED" ) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 13777
# OR
cars %>% filter(Color %in% c("BLUE","RED")) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 13777
# OR
sum(cars$Color == "BLUE" | cars$Color == "RED" )
## [1] 13777
# OR
sum(cars$Color %in% c("BLUE","RED"))
## [1] 13777
A total of 13777 cars.
20. Select all columns in “cars” where the column names starts with “Veh” (using select()
and starts_with()
. Then, use colMeans()
to summarize across these columns.
cars %>% select( starts_with("Veh") ) %>% colMeans()
## VehYear VehicleAge VehOdo VehBCost
## 2005.654252 3.896176 70336.967210 7264.971979
A. Using “cars”, create a new binary (TRUEs and FALSEs) column to indicate if the car has an automatic transmission. Call the new column “is_automatic”.
cars <- cars %>% mutate(is_automatic = (Transmission == "AUTO"))
B. What is the average vehicle odometer reading for cars that are both RED and NISSANs? How does this compare with vehicles that do NOT fit this criteria?
mean_red_nissan <- cars %>%
filter(Color == "RED" & Make == "NISSAN") %>%
summarize(mean = mean(VehOdo)) %>% pull()
mean_not_red_nissan <- cars %>%
filter(Color != "RED" | Make != "NISSAN") %>%
summarize(mean = mean(VehOdo)) %>% pull()
Red Nissans have an average odometer of 7.5117316^{4} while cars not fitting this criteria are 7.0324337^{4}
C. Among red Nissans, what is the distribution of vehicle ages?
red_nissan <- cars %>% filter(Color == "RED" & Make == "NISSAN")
red_nissan %>% group_by(VehicleAge) %>% select(VehicleAge) %>% table()
## .
## 2 3 4 5 6 7 8 9
## 8 35 48 34 13 14 4 2
# OR
red_nissan <- cars %>% filter(Color == "RED" & Make == "NISSAN")
red_nissan %>% group_by(VehicleAge) %>% count()
## # A tibble: 8 x 2
## # Groups: VehicleAge [8]
## VehicleAge n
## <dbl> <int>
## 1 2 8
## 2 3 35
## 3 4 48
## 4 5 34
## 5 6 13
## 6 7 14
## 7 8 4
## 8 9 2
# OR
red_nissan <- cars %>% filter(Color == "RED" & Make == "NISSAN")
red_nissan %>% group_by(VehicleAge) %>% tally()
## # A tibble: 8 x 2
## VehicleAge n
## <dbl> <int>
## 1 2 8
## 2 3 35
## 3 4 48
## 4 5 34
## 5 6 13
## 6 7 14
## 7 8 4
## 8 9 2
D. How many vehicles (using filter()
or sum()
) are made by Chrysler or Nissan and are white or silver?
sum((cars$Make == "CHRYSLER" | cars$Make=="NISSAN") & (cars$Color == "WHITE" | cars$Color == "SILVER" ))
## [1] 3718
# OR
sum(cars$Make %in% c("CHRYSLER","NISSAN") & cars$Color %in% c("WHITE","SILVER" ))
## [1] 3718
# OR
cars %>% filter(Make %in% c("CHRYSLER","NISSAN") & cars$Color %in% c("WHITE","SILVER" )) %>% count()
## # A tibble: 1 x 1
## n
## <int>
## 1 3718
A total of 3718 cars.
E. Make a boxplot (boxplot()
) that looks at vehicle age (“VehicleAge”) on the x-axis and odometer reading (“VehOdo”) on the y-axis.
boxplot( cars %>% pull(VehOdo) ~ cars %>% pull(VehicleAge) )
F. Knit your document into a report.
You use the knit button to do this. Make sure all your code is working first!