# Final Project

## Data

The data that I am using for my project is from the World Happiness Report.

I obtained the data from Kaggle.com at this link: https://www.kaggle.com/ajaypalsinghlo/world-happiness-report-2021/download.


The data comes from an annual report from the Gallup World Poll. Here is a link for more information about this poll: https://www.gallup.com/analytics/247355/gallup-world-happiness-report.aspx. This report has gained global recognition and is used to inform policy making decisions. 

The variables in the data include score estimates for various factors that have been shown to be related to happiness. The estimate for happiness variable is called "Ladder Score". There is data related to a fake country called "Dystopia" which simulates a country that has the worlds lowest values for all the factors related to happiness. The factors related to happiness are sometimes used to explain why some countries report higher levels of happiness.

I attempted to find more about how people were surveyed for the poll, but did not find much information. The poll however occurs annually through a survey of adults around the world.


Now I will load packages that I may need in my analysis:
```{r}
library(tidyverse)
```


## Data Import


I will then import my data by using the `read_csv()` function of the readr package as the data is in csv format. I will name the data Happiness.

```{r}
Happiness <- read_csv("world-happiness-report-2021.csv")
```

I then take a look at my data using the `head()` function. I use the `dim()` function to see how many rows/columns there are.

```{r}
head(Happiness)
dim(Happiness)
```

To determine how many unique countries there are I used the `unqiue()` function and the `length()` function.

```{r}
pull(Happiness, "Country name") %>%
  unique() %>%
  length()
```

This shows that there are 149 Countries.


## Data Cleaning/Wrangling

I decided to rename the first variable called `Country name` so that it would not have a space - this makes it easier to work with as I will not need to use quotes around it for most functions. I used the `rename()` function to do this, listing the new name first.

```{r}

Happiness <- Happiness %>%
  rename("Country_name" = "Country name")
colnames(Happiness)
```

I can see that this column is renamed, but I then realized that many columns have names with spaces. I decided to do the same thing for the `"Ladder score"` variable. I renamed it `"Happiness_score"`, as it is easier to tell what this means.


```{r}
Happiness <- Happiness %>%
  rename("Happiness_score" = "Ladder score")
colnames(Happiness)
```

I then decided to arrange the data by `Happiness_score` using the `arrange()` function with the largest score shown first, thus I needed to use `desc()`.

```{r}
Happiness <- Happiness %>% arrange(desc(Happiness_score))
head(Happiness)
```

I could see from this that Finland is the top scoring country and that all of the top 6 countries are in Western Europe.


I then wanted to know about the US so I filtered for `United States` for the `Country_name`.

```{R}
Happiness %>% filter(Country_name == "United States")
```

Looks like the score was fairly high for the United States - much higher than Dystopia.

I wondered how the US ranked, so I decided to make a variable for rank. To do so I used the `seq()` function to create a sequence of numbers from 1 to 149 the number of countries or rows in the data.

```{r}
Happiness <- Happiness %>% mutate(rank = seq(1:149))
Happiness$rank
```


I then looked again at the US using `filter()`. I used the `pull()` function to get the rank for the US.

```{r}
Happiness %>%
  filter(Country_name == "United States") %>%
  pull(rank)
```

The US ranked 19 in 2021 for happiness out of the 149 countries.

## Data Visualization

I decided to make a plot of the top 5 countries using `geom_point()` and `ggplot()`.
First I created a new dataset of just the top 5 ranking countries using the `filter()` function to filter for countries with a `rank` of less than 5.

```{r}
top5 <- Happiness %>% filter(rank < 5)

ggplot(
  data = top5,
  aes(
    x = Country_name,
    y = Happiness_score
  )
) +
  geom_point()
```
I felt that this plot could be improved. So first I used I made `Country_name` a factor and then reordered it by the `Happiness_score` variable. 

```{r}
top5 <- top5 %>%
  mutate(
    Country_name = as_factor(Country_name),
    Country_name = fct_reorder(Country_name, Happiness_score)
  )

ggplot(
  data = top5,
  aes(
    x = Country_name,
    y = Happiness_score
  )
) +
  geom_point()
```


I then decided to make a plot of a few countries of interest. Since I again wanted to have countries ordered by the `Happiness_score`, I figured I might look at additional countries, so I decided to use the full dataset `Happiness`. 

```{r}


Happiness <- Happiness %>%
  mutate(
    Country_name = as_factor(Country_name),
    Country_name = fct_reorder(Country_name, Happiness_score)
  )
```

Now I check out these other countries that I was interested in by filtering for them from the  `Happiness` data set and creating another plot.

```{r}
int_Country <- Happiness %>% filter(Country_name %in% c("China", "United States", "India", "Mexico"))

ggplot(
  data = int_Country,
  aes(
    x = Country_name,
    y = Happiness_score
  ),
  group = Country_name
) +
  geom_point()
```

I can see from this plot that the United States ranked highest. Mexico ranked second, China third, and India fourth.


I then decided to compare different regions. Using `count()` and `tally()` I determined that there were 10 different regions.

```{r}
count(Happiness, `Regional indicator`)
count(Happiness, `Regional indicator`) %>% tally()
```


I then used `ggplot()`  and `geom_boxplot()` to create a boxplot of the happiness scores for the countries in these different regions. I decided to use `coord_flip()` to flip the coordinates to make it easier to read the different region names. I colored the boxplots but using the `fill` argument within the `aes` specifications. I changed the style of the plot with `theme_linedraw()` and I removed the legend using `theme(legend.position = "none")`.

```{r}

ggplot(
  data = Happiness,
  aes(
    x = `Regional indicator`,
    y = Happiness_score,
    fill = `Regional indicator`
  )
) +
  geom_boxplot() +
  theme_linedraw() +
  theme(legend.position = "none") +
  coord_flip()
```


# Data Analysis

I then decided to perform a simple analysis to compare countries in different regions. I decided to compare countries in `Western Europe` with those of `East Asia`.

I first filtered the data to create two new data sets for these regions only and then performed a t test without assuming equal variance between the scores for the countries for each region.

```{r}
West_Eur <- Happiness %>%
  filter(`Regional indicator` == "Western Europe")
NAmer <- Happiness %>%
  filter(`Regional indicator` == "North America and ANZ")

t.test(
  x = pull(West_Eur, Happiness_score),
  y = pull(NAmer, Happiness_score),
  var.equal = FALSE
)
```

The p-value was greater than 0.05, suggesting that there might not be a difference in happiness scores between Western Europe and North America. 

Disclaimer - this was intended to demonstrate how one might perform such an analysis. We are not claiming that there is a difference or is not a difference in happiness scores between the countries in these two regions. 

## Bonus:

If I wanted to perform this analysis on multiple aspects about my data more simply, I might think about making a function like the following:

```{r}
test_reg_diff <- function(aspect) {
  t.test(
    x = pull(West_Eur, aspect),
    y = pull(NAmer, aspect),
    var.equal = FALSE
  )
}
```

This allows me to just specify what aspect (or variable) I would like to compare between the two regions. I can apply it by supplying the name of the column to pull from the datasets.

```{r}
test_reg_diff(aspect = c("Social support"))
test_reg_diff(aspect = c("Healthy life expectancy"))
```

This tests if there is a column name that is the same as what is supplied as `aspect`. This then gets pulled from each dataset for the t test.

There appears to be no significant difference according to the p-value for `Social Support` or `Healthy life expectancy`, as they are less than 0.5.

Disclaimer - this was intended to demonstrate how one might perform such an analysis. We are not claiming that there is a difference or is not a difference in social support or healthy life expectancy between the countries in these two regions. 

```{r}
sessionInfo()
```

## How this might get evaluated
According to the guidelines: 

1. The student described **where** the data came from including how the instructor could access the data. (0.25 / 0.25 points)
2. The student described information about the data (what the variables mean) and how the data was originally created. (0.25 / 0.25 points)
3. The student performed more than 3 data wrangling methods described in the course, such as arranging data, filtering data, renaming variables, and creating a new variable (`rank`). (3 / 3 points)
4. The student described what steps they did to wrangle the data and why.  (1/1 points)
5. The student made two different kinds of visualizations (scatter/point and boxplot). (2 / 2 points)
6. The student performed a simple analysis on the data to compare two regions with a t-test.  (2 / 2  points)
7. The student described the analysis and provided some interpretation of the results. (1 / 1 points)
8. The student had `session_info()` print at the end of the analysis (0.5 / 0.5 points)
9. The student made sure that the document knit to create an html file.
Bonus: A function was also created worth 1.5 points, but this is not required.

Thus student would get full points (10 out of 10). The extra points from the bonus would not be used to account for less points for other aspects of the project, as the student fulfilled all the other aspects required.