Final Project

 For the final project, the goal is to experience data analysis using statistical tools learned in this course. The dataset will be used to examine the life expectancy across U.S. states in 2021, based on males and females and identifying states with the highest and lowest overall life expectancy. We will also be trying to figure out which sex lives longer, in what states, and by how much. Using the statistical tools that we learned in class, this project combines descriptive statistics, inferential testing, and visualizations to draw meaningful conclusions from this dataset.

The dataset comes from the CDC website under the National Center for Health Statistics. Below is a table with all the abbreviations for each variable that was given based on the dataset.

Variable Description
state              U.S. state name
sex            “Male” or “Female”
le               Life expectancy (years)
se      Standard error of estimate
quartile   Quartile rank of the estimate

> library(tidyverse)
> library(janitor)
> 
> # 1. LOAD DATA AND CLEAN COLUMN NAMES
> df <- read_csv("~/U.S._State_Life_Expectancy_by_Sex,_2021_20251130.csv") %>%
+   clean_names()

The dataset was loaded, and all column names were cleaned for consistency. I used the clean_names() function for all the column names. By using the janitor::clean_names(), the spaces and capitalization are standardized, which ensures subsequent data manipulation is straightforward and free of errors. It allows the dataset to prepare for reshaping and analysis.


> # 3. RENAME COLUMNS CORRECTLY AND CALCULATE OVERALL
> df_clean <- df_clean %>%
+   rename(
+     life_female = `Female`,
+     life_male = `Male`
+   ) %>%
+   mutate(
+     life_total = (life_male + life_female)/2,  # average overall life expectancy
+     gender_gap = life_female - life_male        # female minus male
+   )


The data is reshaped into a wide format instead of long. This is done so each state has a separate column for male and female life expectancy. The overall life expectancy for each state is calculated as the average of both male and female values, and the gender gap that is computed as the difference between female and male life expectancy. This creates a suitable dataset for both descriptive and inferential analysis.


> # 4. SUMMARY STATISTICS
> summary_stats <- df_clean %>%
+   summarize(
+     avg_life = mean(life_total, na.rm = TRUE),
+     min_life = min(life_total, na.rm = TRUE),
+     max_life = max(life_total, na.rm = TRUE)
+   )
> summary_stats
# A tibble: 1 × 3
  avg_life min_life max_life
     <dbl>    <dbl>    <dbl>
1     76.1       71     80.0

The average life expectancy across all U.S. states is approximately 76.1 years, with the highest life expectancy observed in Hawaii (80.0 years) and the lowest in Mississippi (71 years). This highlights substantial geographic variation in health outcomes across the country.


> # 5. TOP 10 STATES (HIGHEST LIFE EXPECTANCY)
> top_states <- df_clean %>% slice_max(life_total, n = 10)
> top_states
# A tibble: 10 × 6
   state      Total life_male life_female
   <chr>      <dbl>     <dbl>       <dbl>
 1 Hawaii      79.9      77          83.1
 2 Massachus…  79.6      76.9        82.2
 3 Connectic…  79.2      76.3        82  
 4 New Jersey  79        76.3        81.6
 5 New York    79        76.3        81.6
 6 Minnesota   78.8      76.3        81.4
 7 New Hamps…  78.5      76.1        81.1
 8 Vermont     78.4      75.7        81.2
 9 Rhode Isl…  78.5      75.9        81  
10 California  78.3      75.3        81.4
# ℹ 2 more variables: life_total <dbl>,
#   gender_gap <dbl>
> 
> # 6. BOTTOM 10 STATES (LOWEST LIFE EXPECTANCY)
> bottom_states <- df_clean %>% slice_min(life_total, n = 10)
> bottom_states
# A tibble: 10 × 6
   state      Total life_male life_female
   <chr>      <dbl>     <dbl>       <dbl>
 1 Mississip…  70.9      67.7        74.3
 2 West Virg…  71        68.1        74.2
 3 Alabama     72        68.9        75.3
 4 Louisiana   72.2      68.8        75.9
 5 Kentucky    72.3      69.6        75.3
 6 Tennessee   72.4      69.4        75.5
 7 Arkansas    72.5      69.7        75.6
 8 Oklahoma    72.7      70          75.6
 9 New Mexico  73        69.4        77  
10 South Car…  73.5      70.4        76.7
# ℹ 2 more variables: life_total <dbl>,
#   gender_gap <dbl>

The states that had consistent life expectancy above 79 years were Hawaii, Massachusetts, and Connecticut. The states that had the lowest life expectancy were Mississippi, West Virgina, and Alabama, reflecting the potential disparities in healthcare, lifestyle, and socioeconomic conditions.


> # 7. GENDER GAP STATISTICS
> gender_gap_stats <- df_clean %>%
+   summarize(
+     avg_gap = mean(gender_gap, na.rm = TRUE),
+     max_gap = max(gender_gap, na.rm = TRUE),
+     min_gap = min(gender_gap, na.rm = TRUE)
+   )
> gender_gap_stats
# A tibble: 1 × 3
  avg_gap max_gap min_gap
    <dbl>   <dbl>   <dbl>
1    5.68    7.60    3.90

On average, females will live approximately 5 years longer than males in the U.S. The largest gender gap is found in Oklahoma, Colorado, Texas, and Maryland, where females live about 5.6 years longer than males. The smallest gap is observed is found in Utah, where the females live about 3.9 years longer. This consistent female advantage is in line with broader demographic trends.


> t.test(df_clean$life_female, df_clean$life_male, paired = TRUE)

	Paired t-test

data:  df_clean$life_female and df_clean$life_male
t = 66.688, df = 50, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 5.511207 5.853499
sample estimates:
mean difference 
       5.682353 

The t-test confirms that the difference between female and male life expectancy is statistically significant (p-value < 2.2e-16). It gives strong evidence that females consistently have higher life expectancy than males across U.S. states.


> # 8. BAR CHART OF LIFE EXPECTANCY
> ggplot(df_clean, aes(x = reorder(state, life_total), y = life_total)) +
+   geom_col(fill = "steelblue") +
+   coord_flip() +
+   labs(
+     title = "Life Expectancy by State (2021)",
+     x = "",
+     y = "Life Expectancy (Years)"
+   ) +
+   theme_minimal()






















The bar chart shows the variation in life expectancy across states. States like Hawaii and Massachusetts are place on the top while states like West Virgina and Mississippi are placed at the bottom. This bar chart shows that the geographic disparities in overall life expectancy.


> # 9. BOXPLOT OF LIFE EXPECTANCY DISTRIBUTION
> ggplot(df_clean, aes(y = life_total)) +
+   geom_boxplot(fill = "lightgreen") +
+   labs(
+     title = "Distribution of Life Expectancy Across U.S. States (2021)",
+     y = "Life Expectancy (Years)"
+   ) +
+   theme_minimal()


















The boxplot shows that the most states cluster near the average, with a few outliers. While there is regional variation is confirmed, most states have life expectancy between 75 and 77 years.


> # 10. GENDER GAP PLOT
> ggplot(df_clean, aes(x = gender_gap, y = reorder(state, gender_gap))) +
+   geom_col(fill = "purple") +
+   labs(
+     title = "Female–Male Life Expectancy Gap by State (2021)",
+     x = "Years (Female – Male)",
+     y = ""
+   ) +
+   theme_minimal()




























The gender gap chart shows that in almost every state; females live longer than male even though the magnitude of this advantage varies. States that are on the west side of the U.S. generally show higher overall life expectancy, but the gender gap is more pronounced in certain states in the south and west. This shows that both sex-based and regional disparities in health outcomes.

Conclusion

This analysis shows that females consistently have higher life expectancy than males across all U.S. states, with an average gap of around five years. The significant geographic variation was in the north and west states generally outperforming states in the south. The graphs and statistical tests give strong evidence of these patterns and giving an objective summary of state level life expectancy differences.

By combining descriptive and inferential statistics with effective visuals, this project offers insights into how life expectancy varies by sex and region in the U.S. and shows how statistical analysis can inform public health understanding.


Comments

Popular posts from this blog

Module 5 Assignment

Module 2 Assignment LIS4273

Module 6 Assignment