KEY Introduction to R: Homework 3

Problem Set

1. Bring the dataset into R. The dataset is located at: https://jhudatascience.org/intro_to_R_class/data/mortality.csv. You can use the link, download it, or use whatever method you like for getting the file. Once you get the file, read the dataset in using read_csv() and assign it the name “mort”.

mort <- read_csv("https://jhudatascience.org/intro_to_R_class/data/mortality.csv")

## New names:
## * `` -> ...1

## Rows: 197 Columns: 255

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): ...1
## dbl (254): 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770,...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Run the colnames() function to take a look at the dataset column names. You should see that there was originally no name for the first column and that R replaced it with “…1”. Rename the first column of “mort” to “country” using the rename() function in dplyr.

colnames(mort)

##   [1] "...1" "1760" "1761" "1762" "1763" "1764" "1765" "1766" "1767" "1768"
##  [11] "1769" "1770" "1771" "1772" "1773" "1774" "1775" "1776" "1777" "1778"
##  [21] "1779" "1780" "1781" "1782" "1783" "1784" "1785" "1786" "1787" "1788"
##  [31] "1789" "1790" "1791" "1792" "1793" "1794" "1795" "1796" "1797" "1798"
##  [41] "1799" "1800" "1801" "1802" "1803" "1804" "1805" "1806" "1807" "1808"
##  [51] "1809" "1810" "1811" "1812" "1813" "1814" "1815" "1816" "1817" "1818"
##  [61] "1819" "1820" "1821" "1822" "1823" "1824" "1825" "1826" "1827" "1828"
##  [71] "1829" "1830" "1831" "1832" "1833" "1834" "1835" "1836" "1837" "1838"
##  [81] "1839" "1840" "1841" "1842" "1843" "1844" "1845" "1846" "1847" "1848"
##  [91] "1849" "1850" "1851" "1852" "1853" "1854" "1855" "1856" "1857" "1858"
## [101] "1859" "1860" "1861" "1862" "1863" "1864" "1865" "1866" "1867" "1868"
## [111] "1869" "1870" "1871" "1872" "1873" "1874" "1875" "1876" "1877" "1878"
## [121] "1879" "1880" "1881" "1882" "1883" "1884" "1885" "1886" "1887" "1888"
## [131] "1889" "1890" "1891" "1892" "1893" "1894" "1895" "1896" "1897" "1898"
## [141] "1899" "1900" "1901" "1902" "1903" "1904" "1905" "1906" "1907" "1908"
## [151] "1909" "1910" "1911" "1912" "1913" "1914" "1915" "1916" "1917" "1918"
## [161] "1919" "1920" "1921" "1922" "1923" "1924" "1925" "1926" "1927" "1928"
## [171] "1929" "1930" "1931" "1932" "1933" "1934" "1935" "1936" "1937" "1938"
## [181] "1939" "1940" "1941" "1942" "1943" "1944" "1945" "1946" "1947" "1948"
## [191] "1949" "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957" "1958"
## [201] "1959" "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967" "1968"
## [211] "1969" "1970" "1971" "1972" "1973" "1974" "1975" "1976" "1977" "1978"
## [221] "1979" "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988"
## [231] "1989" "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998"
## [241] "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008"
## [251] "2009" "2010" "2030" "2050" "2099"

mort <- mort %>% rename(country = `...1`)

3. Select only the numeric type columns (select()). Then, create the variable “year” from column names by using the colnames() function to extract them.

year <- mort %>% select( -country ) %>% colnames()
# OR 
year <- mort %>% select( starts_with( c("1", "2") ) ) %>% colnames()
# OR 
year <- mort %>% select( where(is.numeric) ) %>% colnames()

4. What is the typeof() for “year”? If it’s not an integer, turn it into integer form with as.integer().

typeof(year)

## [1] "character"

year <- as.integer(year)

“year” is of type integer.

5. Use the pct_complete() function in the naniar package to determine the percent missing data in “mort”. You might need to load and install naniar!

library(naniar)
pct_complete(mort)

## [1] 66.95332

“mort” is 66.9533194 percent complete.

6. Are there any countries that have a complete record in “mort” across all years? Just look at the putput here, don’t reassign it. Hint: look for complete records by dropping all NAs from the dataset using drop_na().

drop_na(mort)

## # A tibble: 2 x 255
##   country  `1760` `1761` `1762` `1763` `1764` `1765` `1766` `1767` `1768` `1769`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Sweden     2.21   2.30   2.79   2.94   2.44   2.35   2.23   2.34   2.44   2.40
## 2 United …   2.20   2.35   2.32   2.32   2.37   2.39   2.27   2.29   2.28   2.32
## # … with 244 more variables: 1770 <dbl>, 1771 <dbl>, 1772 <dbl>, 1773 <dbl>,
## #   1774 <dbl>, 1775 <dbl>, 1776 <dbl>, 1777 <dbl>, 1778 <dbl>, 1779 <dbl>,
## #   1780 <dbl>, 1781 <dbl>, 1782 <dbl>, 1783 <dbl>, 1784 <dbl>, 1785 <dbl>,
## #   1786 <dbl>, 1787 <dbl>, 1788 <dbl>, 1789 <dbl>, 1790 <dbl>, 1791 <dbl>,
## #   1792 <dbl>, 1793 <dbl>, 1794 <dbl>, 1795 <dbl>, 1796 <dbl>, 1797 <dbl>,
## #   1798 <dbl>, 1799 <dbl>, 1800 <dbl>, 1801 <dbl>, 1802 <dbl>, 1803 <dbl>,
## #   1804 <dbl>, 1805 <dbl>, 1806 <dbl>, 1807 <dbl>, 1808 <dbl>, 1809 <dbl>,
## #   1810 <dbl>, 1811 <dbl>, 1812 <dbl>, 1813 <dbl>, 1814 <dbl>, 1815 <dbl>,
## #   1816 <dbl>, 1817 <dbl>, 1818 <dbl>, 1819 <dbl>, 1820 <dbl>, 1821 <dbl>,
## #   1822 <dbl>, 1823 <dbl>, 1824 <dbl>, 1825 <dbl>, 1826 <dbl>, 1827 <dbl>,
## #   1828 <dbl>, 1829 <dbl>, 1830 <dbl>, 1831 <dbl>, 1832 <dbl>, 1833 <dbl>,
## #   1834 <dbl>, 1835 <dbl>, 1836 <dbl>, 1837 <dbl>, 1838 <dbl>, 1839 <dbl>,
## #   1840 <dbl>, 1841 <dbl>, 1842 <dbl>, 1843 <dbl>, 1844 <dbl>, 1845 <dbl>,
## #   1846 <dbl>, 1847 <dbl>, 1848 <dbl>, 1849 <dbl>, 1850 <dbl>, 1851 <dbl>,
## #   1852 <dbl>, 1853 <dbl>, 1854 <dbl>, 1855 <dbl>, 1856 <dbl>, 1857 <dbl>,
## #   1858 <dbl>, 1859 <dbl>, 1860 <dbl>, 1861 <dbl>, 1862 <dbl>, 1863 <dbl>,
## #   1864 <dbl>, 1865 <dbl>, 1866 <dbl>, 1867 <dbl>, 1868 <dbl>, 1869 <dbl>, …

7. Reshape the “complete” data to long form.

There should be a column for country (“country”), a column for year (“year”), and a column for the mortality value (“mortality”).
Use pivot_longer().
You should pivot all columns except “country”.
Hint: listing !COLUMN or -COLUMN means everything except COLUMN.
Assign the reshaped data to “long”.

long <- 
  pivot_longer(mort, -country, names_to = "year", values_to = "mortality")
# OR
long <- 
  pivot_longer(mort, !country, names_to = "year", values_to = "mortality")
# OR
long <- 
  pivot_longer(mort, starts_with( c("1", "2") ), names_to = "year", values_to = "mortality")

8. Bring an additional dataset into R. The dataset is tab-delimited and located at: https://jhudatascience.org/intro_to_R_class/data/country_pop.txt. You can use the link, download it, or use whatever method you like for getting the file. Once you get the file, read the dataset in using read_tsv() and assign it the name “pop”.

pop <- read_tsv("https://jhudatascience.org/intro_to_R_class/data/country_pop.txt")

## Rows: 242 Columns: 6

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): Country (or dependent territory), Date, % of world population, Source
## dbl (1): Rank

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

9. Rename the second column in “pop” to “country” and the column “% of world population”, to “percent”. Use the rename() function. Don’t forget to reassign the renamed data to “pop”.

pop <- pop %>% 
  rename(country = `Country (or dependent territory)`,
         percent = `% of world population`)

10. Sort the data in “pop” by “Population” from largest to smalled using arrange() and desc(). After sorting, select() “country” to create an one-column tibble of countries ordered by population. Assign this data the name “country_ordered”.

country_ordered <- pop %>% 
  arrange(desc(Population)) %>% 
  select(country)

11. Subset “long” based on years 2000-2010, including 2000 and 2010 and call this “long_sub” using & or the between() function. Confirm your filtering worked by looking at the range of “year”. If you’re getting a strange error, make sure you created the “year” column in problem #7.

long_sub <- long %>% filter(year >= 2000 & year <= 2010)
long_sub %>% pull(year) %>% range() # confirm it worked

## [1] "2000" "2010"

12. Further subset long_sub. You will filter for specific countries using filter() and the %in% operator. Only include countries in this list: c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Canada"). Make sure to reassign to “long_sub”.

long_sub <- long_sub %>% 
  filter(country %in% c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Canada"))

13. Use pivot_wider() to turn the “year” column of “long_sub” into multiple columns, each representing a different year. Fill values (values_from=) with “mortality”. Assign this pivoted dataset the name “mort_sub”.

mort_sub <- long_sub %>% 
  pivot_wider(id_cols = country, names_from = year, values_from = mortality)

14. Using “country_ordered” and “mort_sub”, right_join() the two datasets by “country”. Use the pipe %>% to join this dataset to “pop”, keeping only the data on the lefthand side of the join. Call this “joined”.

joined <- country_ordered %>% 
  right_join(mort_sub, by = "country") %>% 
  left_join(pop, by = "country")

15. The values in the table are percentages of the total population (not proportion). Create a new column called “mort_count” that estimates the total number of child deaths per year based on the total population. You can use any year, or an average of all of them, to make your calculation. Whatever you choose, justify your choice. Finally, select() only “country”, “Population”, and “mort_count” and view the data.

# Justification is just for fun. The main point is that decisions in your analysis should depend on your reasoning not how many lines of code it takes :)

# Use 2010: There appears to be a downward trend in mortality rates, so using 2010 could be the most accurate for future years.
joined %>% mutate( mort_count = Population * `2010` / 100) %>% select(country, Population, mort_count)

## # A tibble: 6 x 3
##   country   Population mort_count
##   <chr>          <dbl>      <dbl>
## 1 Iran        77056000     79995.
## 2 Thailand    65926261     58939.
## 3 Canada      35002447     17119.
## 4 Venezuela   28946101     40455.
## 5 Estonia      1294455      1122.
## 6 Bahrain      1234571      1609.

# OR

# Use an average 2000-2010: Using the average is more robust to fluctuations and uncertainty in the coming years.
avg_mort <- rowMeans(joined %>% select( starts_with("2") ))
joined <- joined %>% mutate(avg_pct_2000 = avg_mort)
joined %>% mutate( mort_count = Population * avg_pct_2000 / 100) %>% select(country, Population, mort_count)

## # A tibble: 6 x 3
##   country   Population mort_count
##   <chr>          <dbl>      <dbl>
## 1 Iran        77056000    107019.
## 2 Thailand    65926261     65944.
## 3 Canada      35002447     16886.
## 4 Venezuela   28946101     46433.
## 5 Estonia      1294455      1265.
## 6 Bahrain      1234571      1772.

KEY Introduction to R: Homework 3

Instructions

Problem Set

Bonus Practice