EPPS Math Coding Camp

Advanced Data Manipulation using dplyr Package

Home

Published

August 19, 2025

Instructor: Dongeun Kim

Check in form

Advanced Data Manipulation using dplyr Package

Introduction


dplyr is a powerful R package for data manipulation. It provides intuitive and efficient functions for transforming data frames, making it a staple in data analysis workflows. This session will cover key dplyr functions, including group_by(), summarize(), joining data frames, and reshaping data with tidyr.

Loading the Data

Code
# Load necessary packages
# install.packages("dplyr")
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
# Set the working directory
setwd("C:/Users/yohoh/OneDrive - The University of Texas at Dallas/2025EPPSBootCamp/Data")

# Reading the CSV file
life_expectancy_data <- read.csv("US_Life_expectancy.csv")

# View the first few rows of the data
head(life_expectancy_data)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate
1 1900 Female                    48.3                  2410.4
2 1901 Female                    50.6                  2350.5
3 1902 Female                    53.4                  2162.8
4 1903 Female                    52.0                  2250.6
5 1904 Female                    49.1                  2358.8
6 1905 Female                    50.2                  2287.7


The read.csv() function loads the dataset, and head() displays the first few rows, providing an overview of the data structure and variables.

Grouping and Summarizing Data

Grouping Data with group_by()

Code
# Grouping data by Gender
grouped_data <- life_expectancy_data %>% group_by(Gender)


The group_by() function groups the dataset by the specified variable (Gender).
This operation is essential for performing group-wise computations, such as calculating averages or counts.

Aggregate Grouped Data with summarize()

Code
# Summarizing Average Life Expectancy by Gender
summary_data <- grouped_data %>%
  summarize(Average_Life_Expectancy = mean(Average_Life_Expectancy, na.rm = TRUE),
            Average_Death_Rate = mean(Age_adjusted_Death_Rate, na.rm = TRUE))

# Viewing the summarized data
summary_data
# A tibble: 2 × 3
  Gender Average_Life_Expectancy Average_Death_Rate
  <chr>                    <dbl>              <dbl>
1 Female                    69.6              1288.
2 Male                      64.4              1662.


The summarize() function calculates summary statistics for each group created by group_by().
Here, it computes the average life expectancy and average death rate for each gender, ignoring NA values with na.rm = TRUE.
This function is highly customizable, allowing for a wide range of summary calculations.

Exercise:

Group the data by Year, Aggregate by sum the Age_adjusted_Death_Rate

Code
year_summary <- life_expectancy_data %>%
  group_by(Year) %>%
  summarize(Total_Death_Rate = sum(Age_adjusted_Death_Rate, na.rm = TRUE))
head(year_summary)
# A tibble: 6 × 2
   Year Total_Death_Rate
  <int>            <dbl>
1  1900            5041.
2  1901            4951 
3  1902            4611.
4  1903            4764.
5  1904            5012.
6  1905            4853.

Joining Data Frames

Introduction to Joins


Joining data frames is crucial when working with related datasets.
dplyr provides several join functions, such as inner_join(), left_join(), right_join(), and full_join(). These functions merge datasets based on common keys.

Code
# Creating a sample dataset for joining based on the year 2000

data_to_join <- data.frame(
  Year = 1800:2010,
  Year_Category = ifelse(1800:2010 <2010, "Before 2000", "After 2000")
)

# Joining data frames
joined_data_l <- left_join(life_expectancy_data, data_to_join, by = c('Year'))

joined_data_r <- right_join(life_expectancy_data, data_to_join, by = c('Year'))

joined_data_i <- inner_join(life_expectancy_data, data_to_join, by = "Year")

joined_data_o <- full_join(life_expectancy_data, data_to_join, by = "Year")

# Viewing the joined data
head(joined_data_l)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000
Code
head(joined_data_r)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000
Code
head(joined_data_i)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000
Code
head(joined_data_o)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4   Before 2000
2 1901 Female                    50.6                  2350.5   Before 2000
3 1902 Female                    53.4                  2162.8   Before 2000
4 1903 Female                    52.0                  2250.6   Before 2000
5 1904 Female                    49.1                  2358.8   Before 2000
6 1905 Female                    50.2                  2287.7   Before 2000


additional_data_2000 specifies categories for “Before 2000” and “After 2000” for each gender.
left_join() merges life_expectancy_data and additional_data_2000 based on the Gender and Year_Category columns.
In a left join, all rows from the left data frame are kept, with matching rows from the right data frame added.

merge() Or you can use merge().

Parameters: merge(x, y, by = “common_column”, all = FALSE, all.x = FALSE, all.y = FALSE,…)


Exercise:

Create a new dataset with a column specifying if the year is before or after 1945, and join it ON the original dataset, keeping the original dataset intact and without na values. Hint: which side of join you should use?

Code
additional_data_war <- data.frame(
  Year = 1900:2020,
  Year_Category = ifelse(1900:2020 <1945, "Before WW2", "After WW2")
)

# Joining data frames
joined_data_ww2 <- left_join(life_expectancy_data, additional_data_war, by = 'Year')

# Viewing the joined data
head(joined_data_ww2)
  Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female                    48.3                  2410.4    Before WW2
2 1901 Female                    50.6                  2350.5    Before WW2
3 1902 Female                    53.4                  2162.8    Before WW2
4 1903 Female                    52.0                  2250.6    Before WW2
5 1904 Female                    49.1                  2358.8    Before WW2
6 1905 Female                    50.2                  2287.7    Before WW2

Adding data using function with mutate()

Add a column of calculated values using Mutate

Code
life_expectancy_data_dev <- mutate(
  life_expectancy_data, 
  deviation=Average_Life_Expectancy - mean(Average_Life_Expectancy))

Exercise:

  • Normalize the Average_Life_Expectancy between 0 and 1 and store these in a new column named accordingly.

hint:

\[ \text{normalized value} = \frac{\text{value} - \min(\text{value})}{\max(\text{value}) - \min(\text{value})} \]

Code
life_expectancy_data_norm <- mutate(
  life_expectancy_data, 
  life_expectancy_normalized = (Average_Life_Expectancy - min(Average_Life_Expectancy)) / 
                               (max(Average_Life_Expectancy) - min(Average_Life_Expectancy))
)

Wrapping up

In this session, we’ve explored advanced data manipulation techniques using dplyr, including grouping and summarizing data, joining data frames, and reshaping data with tidyr. These skills are crucial for preparing and analyzing data efficiently.

Continue practicing these techniques with different datasets to solidify your understanding and explore new data analysis possibilities.

Reference