1. Life Expectation by Regions

data %>% 
  janitor::clean_names() %>% 
  group_by(region) %>% 
  summarise(avg_by_region = mean(une_life,na.rm = T)) %>% 
  arrange(avg_by_region) %>% 
  knitr::kable(digits = 3)
region avg_by_region
Africa 57.070
South-East Asia 68.754
Eastern Mediterranean 70.070
Western Pacific 71.604
Americas 73.347
Europe 75.700
  • The table shows the arranged average life expectancy of each region from 2000 to 2016.

  • The range of average life expectancy among regions is about 8 years.

  • Europe has the highest average life expectancy while Africa has the lowest.

 
data %>% 
  janitor::clean_names() %>% 
  group_by(region) %>% 
  ggplot(aes(x = fct_reorder(region,une_life), y = une_life, fill = region)) +
  geom_boxplot() +
  labs(title = "Boxplot of Life Expectancy by Regions") +
  xlab("Region") +
  ylab("Life Expectancy") + 
  theme(axis.text.x = element_text(hjust = 1, angle = 10,size = 8))

  • The boxplot shows the distribution of life expectancy in each region.

  • The variance of life expectancy is higher in Eastern Mediterranean and Africa.

  • The Americas has many outliers with low life expectancy

  • We can roughly tell from the plot that the variances of life expectancy among regions are not equal. Thus, we may perform a statistical test to check heteroscedasticity

 

test of equal variances

\[H_0: Equal\ \ variance \ \ among\ \ regions \ \text { vs } \ H_1: Unequal\ \ variance \]

bartlett.test(une_life ~ factor(region),data = data) %>% 
  broom::tidy() %>% 
  knitr::kable()
statistic p.value parameter method
318.8595 0 5 Bartlett test of homogeneity of variances

The null hypothesis for Bartlett test is that the variances are equal. The result shows that the p-value is less than 0.05. Thus, we may reject the null and conclude that the variances of life expectancy among different regions is not equal. Consequently, we cannot perform ANOVA to test the difference of mean life expectancy in all the six regions. We should perform t.test between two selected regions separately.



2. t.test : Compare Mean Life Expectancy Between Americas and Europe

 

From the boxplot above, we find that the life expectancy in Americas and Europe distribute almost in the same interval. Though the median of Europe is higher, the variance in Americas seems smaller. Thus, we want to study if the mean life expectancy in the two regions are significantly different.

 

extract life expectancy in Americas and Europe

Americas <- data %>% 
  janitor::clean_names() %>% 
  filter(region == "Americas") %>% 
  pull(une_life)


Europe <- data %>% 
  janitor::clean_names() %>% 
  filter(region == "Europe") %>% 
  pull(une_life) 

In order to decide on which type of t.test we should perform, we need to compare the variance in Americas and Europe first.

We can roughly tell from the boxplot that Americas has a smaller variance. But there also exist many outliers with low life expectancy values in Americas. Thus, we turn to statistical test to decide on the relationship.

   

test equal variance

\[H_0: \sigma^2_{Americas} =\ \sigma^2_{Europe}\ \text { vs } \ H_1: \sigma^2_{Americas} \neq\ \sigma^2_{Europe} \]

var.test(Americas,Europe,alternative = "two.sided",conf.level = 0.95) %>% 
  broom::tidy() %>% 
  knitr::kable()
## Multiple parameters; naming those columns num.df, den.df
estimate num.df den.df statistic p.value conf.low conf.high method alternative
0.7259117 560 849 0.7259117 4.12e-05 0.6248981 0.8452553 F test to compare two variances two.sided

The null hypothesis for the variance test is that the two variance are equal. The result shows that the p-value is much less than 0.05. Thus, we may reject the null hypothesis and conclude that the variances are not equal. Next, we should perform 2 sample t.test with unknown and unequal variance.

   

2 sample t.test with unknown unequal variances

\[H_0: \text{mean life_exp}_{Americas} = \ \text{mean life_exp}_{Europe}\ \text { vs } \ H_1: \text{mean life_exp}_{Americas} \neq \text{mean life_exp}_{Europe} \]

t.test(Americas,Europe,alternative = "less",conf.level = 0.95,paired = F,var.equal = FALSE ) %>% 
  broom::tidy() %>% 
  knitr::kable()
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
-2.352754 73.34729 75.70004 -9.837228 0 1320.964 -Inf -1.959081 Welch Two Sample t-test less

The null hypothesis for the t.test is that the two variance are equal. The result shows that the p-value is much less than 0.05. Thus, we may reject the null hypothesis and conclude that the mean of life expectancy in Americas and Europe are different. Since the test statistics is negative, we know that mean life expectancy in Americas is smaller than Europe.



3. prop.test : Compare the Proportion of Life Expectancy Over 70 Between Western Pacific and South-East Asia

 

We can tell from the boxplot above that the boxes of Western Pacific and South-East Asia are almost overlapping. The majority of them seem to be over 65 years. Thus we are interested in comparing the proportion of life expectancy over 65 year in the two regions.

\[H_0: \text{Proportion}_{Western\ Pacific} = \ \text{Proportion}_{South-East\ Asia}\ \text { vs } \ H_1: \text{mean life_exp}_{Western\ Pacific} \neq \text{mean life_exp}_{South-East\ Asia} \]

data %>% 
  janitor::clean_names() %>% 
  filter(region == "Western Pacific") %>% 
  summarise(above_65 = sum(une_life > 65),
            total = n()) %>% 
  knitr::kable()
above_65 total
304 357
data %>% 
  janitor::clean_names() %>% 
  filter(region == "South-East Asia")  %>% 
  summarise(above_65 = sum(une_life > 65),
            total = n()) %>% 
  knitr::kable()
above_65 total
149 187
prop.test(c(304,149),n = c(357,187),correct = F) %>% 
  broom::tidy() %>% 
  knitr::kable()
estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
0.8515406 0.7967914 2.640731 0.1041556 1 -0.0137086 0.1232069 2-sample test for equality of proportions without continuity correction two.sided

The null hypothesis for the prop.test is that the two proportions are equal. The result shows that the p-value is approximately 0.104. Thus, under the significance level of 0.05, we fail to reject the null. We have evidence that the proportion of life expectancy above 65 in Western Pacific is the same as in South-East Asia.



4. Average Life Expectancy by (region, year) Combination

data %>% 
  janitor::clean_names() %>% 
  group_by(region,year) %>% 
  summarise(avg_by_year_region = mean(une_life,na.rm = T)) %>% 
  pivot_wider(
    names_from = region,
    values_from = avg_by_year_region
  ) %>% 
  knitr::kable(digits = 3)
year Africa Americas Eastern Mediterranean Europe South-East Asia Western Pacific
2000 52.816 71.661 68.080 73.602 64.851 69.400
2001 53.040 71.900 68.350 73.923 65.463 69.706
2002 53.294 72.129 68.619 74.072 66.068 69.995
2003 53.681 72.354 68.889 74.252 66.644 70.304
2004 54.192 72.580 69.159 74.642 67.181 70.611
2005 54.744 72.789 69.425 74.784 67.677 70.899
2006 55.385 72.994 69.684 75.063 68.134 71.175
2007 56.106 73.201 69.930 75.287 68.566 71.439
2008 56.840 73.395 70.161 75.613 68.983 71.692
2009 57.594 73.601 70.378 75.957 69.388 71.970
2010 58.343 73.796 70.585 76.255 69.780 72.185
2011 59.065 73.985 70.788 76.672 70.161 72.409
2012 59.807 74.170 70.993 76.863 70.526 72.657
2013 60.445 74.343 71.203 77.166 70.873 72.892
2014 61.062 74.512 71.420 77.483 71.204 73.113
2015 61.641 74.669 71.645 77.493 71.516 73.316
2016 62.131 74.826 71.875 77.773 71.809 73.504
data %>% 
  janitor::clean_names() %>% 
  group_by(region,year) %>% 
  summarise(avg_by_year_region = mean(une_life,na.rm = T)) %>% 
  ggplot(aes(x = year, y = avg_by_year_region, color = region)) +
  geom_line() + geom_point() + labs(title = "Average Life Expectancy by Region, Year") + ylab ("Average Life Expectancy")

  • The table and line graph above show the descriptive statistics and overall trend.
  • It is clearly shown in the plot that Africa has a much lower life expectancy than the other regions.



5. Life Expectancy by Income Level

data %>% 
  janitor::clean_names()  %>% 
  filter(!is.na(income_group)) %>%  
  group_by(income_group) %>% 
  summarise(avg_by_income = mean(une_life,na.rm = T)) %>% 
  arrange(avg_by_income) %>% 
  knitr::kable(digits = 3)
income_group avg_by_income
Low income 56.798
Lower middle income 64.725
Upper middle income 70.861
High income 77.818
data %>% 
  janitor::clean_names() %>% 
  filter(!is.na(income_group)) %>% 
  group_by(income_group) %>% 
  ggplot(aes(x = fct_reorder(income_group,une_life), y = une_life,fill = income_group)) +
  geom_boxplot() +
  labs(title = "Boxplot of Life Expectancy by Income Groups") +
  xlab("Income Groups") +
  ylab("Life Expectancy")

  • The boxes of different income groups are almost not overlapping with each other.
  • The pattern is clear that people from higher income group tend to have a higher life expectancy.

\[H_0: \sigma^2_{group \ i} =\ \sigma^2_{group \ j}\ \text { vs } \ H_1: \sigma^2_{group\ i} \neq\ \sigma^2_{group\ j} \]

  • We have performed variance tests and conclude that the variances are not equal between any two groups. Since the method is similar to what we have used and displayed when studying the regional differences, we do not show the process here.

\[H_0: \text{mean life_exp}_{group \ i} = \ \text{mean life_exp}_{group \ j}\ \text { vs } \ H_1: \text{mean life_exp}_{group \ i} \neq \text{mean life_exp}_{group\ j} \]

  • We have also performed 2 sample t.test to compare the mean life expectancy between groups. We conclude that the means are not equal between any two groups. Since the method is similar to what we have used and displayed when studying the regional differences, we do not show the process here.



6. Life Expectancy by Development Status

data %>% 
  janitor::clean_names() %>% 
  group_by(developed_developing_countries) %>% 
  summarise(avg_by_dev = mean(une_life,na.rm = T)) %>% 
  knitr::kable(digits = 3)
developed_developing_countries avg_by_dev
Developed 77.403
Developing 66.122
data %>% 
  janitor::clean_names() %>% 
  group_by(developed_developing_countries) %>% 
  ggplot(aes(x = developed_developing_countries, y = une_life,fill = developed_developing_countries)) +
  geom_boxplot() +
  labs(title = "Boxplot of Life Expectancy by Development Status") +
  xlab("Development Status") +
  ylab("Life Expectancy")

  • The boxes of different development status are almost not overlapping with each other.
  • The pattern is clear that people in developed countries tend to have a higher life expectancy.

\[H_0: \sigma^2_{developing} =\ \sigma^2_{developed}\ \text { vs } \ H_1: \sigma^2_{developing} \neq\ \sigma^2_{developed} \]

  • We have performed variance test and conclude that the variances are not equal between the two categories. Since the method is similar to what we have used and displayed when studying the regional differences, we do not show the process here.

\[H_0: \text{mean life_exp}_{developing} = \ \text{mean life_exp}_{developed}\ \text { vs } \ H_1: \text{mean life_exp}_{developing} \neq \text{mean life_exp}_{developed} \]

  • We have also performed 2 sample t.test to compare the mean life expectancy between the two groups. We conclude that the means are not equal. Since the method is similar to what we have used and displayed when studying the regional differences, we do not show the process here.



Summary

In this part, we applied statistical analysis to studying the differences in life expectancy caused by various factors. We first examine the distribution features and then choose appropriate tests to perform.

We draw the following conclusions from the tests and plots result :

  • Life expectancy in Americas and Europe have significantly different means.

  • Europe has the highest mean life expectancy in the world.

  • Africa has the lowest life expectancy in the world.

  • Western Pacific and South-East Asia has approximately the same proportion of life expectancy over 65.

  • Average life expectancy are significantly different among income groups and development status. People with higher income and from more developed countries tend to live longer.