We learned access elements of matrix and vectors in the previous sessions. Indexing in Dataframes is similar. Remember, R uses 1-based indexing (the first element is at position 1, not 0).
Examples:
A quick review of the DataFrame
Code
# Load data ev_data <- readxl::read_excel("Electric_Vehicle_Population_Data.xlsx")head(ev_data)
# A tibble: 6 × 17
`VIN (1-10)` County City State Postal_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRAN…
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODE…
4 1G1RC6S5XH Kitsap Port Orchard WA 98367 2017 CHEVRO… VOLT
5 5UXTA6C09P Snohomish Monroe WA 98272 2023 BMW X5
6 1FMCU0EZXN Yakima Moxee WA 98936 2022 FORD ESCA…
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>
Observe what is the return of the following codes:
Code
ev_data[1, 1]
# A tibble: 1 × 1
`VIN (1-10)`
<chr>
1 1C4RJXN66R
Code
ev_data[, 2]
# A tibble: 194,232 × 1
County
<chr>
1 Snohomish
2 King
3 King
4 Kitsap
5 Snohomish
6 Yakima
7 Thurston
8 Snohomish
9 Snohomish
10 Snohomish
# ℹ 194,222 more rows
Code
ev_data[, c(2,3)]
# A tibble: 194,232 × 2
County City
<chr> <chr>
1 Snohomish Everett
2 King Renton
3 King Seattle
4 Kitsap Port Orchard
5 Snohomish Monroe
6 Yakima Moxee
7 Thurston Olympia
8 Snohomish Monroe
9 Snohomish Everett
10 Snohomish Everett
# ℹ 194,222 more rows
Code
ev_data[1, ]
# A tibble: 1 × 17
`VIN (1-10)` County City State Postal_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRANGLER
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>
Code
ev_data[1:3, ]
# A tibble: 3 × 17
`VIN (1-10)` County City State Postal_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRANGLER
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODEL 3
# ℹ 9 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>
Mind the type of the output. Observe what is the return of the following codes and see the difference:
Code
ev_data[[1, 1]]
[1] "1C4RJXN66R"
Code
class(ev_data[[1, 1]])
[1] "character"
Code
class(ev_data[1, 1])
[1] "tbl_df" "tbl" "data.frame"
What will happen if you run this line and why?
Code
# ev_data[[, 2]]
Exercise:
Load the Electric Vehicle Population Data (Excel) on your machine
# Subsetting specific columnsev_data[, c("Make", "Model", "Electric_Range")]
# A tibble: 194,232 × 3
Make Model Electric_Range
<chr> <chr> <dbl>
1 JEEP WRANGLER 21
2 KIA SOUL 93
3 TESLA MODEL 3 266
4 CHEVROLET VOLT 53
5 BMW X5 30
6 FORD ESCAPE 38
7 TESLA MODEL S 208
8 NISSAN LEAF 84
9 PORSCHE CAYENNE 14
10 NISSAN LEAF 0
# ℹ 194,222 more rows
Exercise:
Get the 10th row’s ‘City’ Column value.
Note
Code
ev_data[10, 'City']
# A tibble: 1 × 1
City
<chr>
1 Everett
1.1.3 Subsetting Data With condition
Subsetting is the process of extracting a portion of a dataset based on specific conditions. This can be useful for focusing on particular groups or conditions within your data.
Detailed Explanation:
Row and Column Subsetting: Use indexing to subset specific rows and columns.
Logical Conditions: Subset based on logical conditions using operators like ==, !=, <, >, &, |.
Examples:
Code
# Subsetting rows based on a conditionev_subset <- ev_data[ev_data$Model_Year ==2020, ]# Subsetting rows and columnsev_subset_cols <- ev_data[ev_data$Model_Year ==2020, c("Make", "Electric_Range")]
Using logic operator to do complicated selection
Code
# Find new cars with high electric rangeev_data[ev_data$Model_Year >2018& ev_data$Electric_Range >=100 , c("Make", "Model")]
# A tibble: 19,761 × 2
Make Model
<chr> <chr>
1 TESLA MODEL 3
2 TESLA MODEL 3
3 TESLA MODEL Y
4 CHEVROLET BOLT EV
5 TESLA MODEL 3
6 TESLA MODEL 3
7 TESLA MODEL 3
8 NISSAN LEAF
9 TESLA MODEL 3
10 TESLA MODEL 3
# ℹ 19,751 more rows
Code
# Find cars that is near me (Richardson is a city in Dallas and Collin counties)ev_data[(ev_data$County =='Collin'| ev_data$County =='Dallas'), c("Make", "Model")]
# A tibble: 10 × 2
Make Model
<chr> <chr>
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> <NA>
6 TESLA MODEL 3
7 <NA> <NA>
8 <NA> <NA>
9 <NA> <NA>
10 <NA> <NA>
Code
# Find cars outside TX that are made by Tesla or is expensiveev_data[ev_data$State !='TX'& (ev_data$Make =='TESLA'| ev_data$`Base MSRP`>=10000), c("Make", "Model")]
# A tibble: 87,104 × 2
Make Model
<chr> <chr>
1 KIA SOUL
2 TESLA MODEL 3
3 TESLA MODEL S
4 TESLA MODEL 3
5 TESLA MODEL S
6 TESLA MODEL 3
7 TESLA MODEL Y
8 TESLA MODEL X
9 TESLA MODEL S
10 TESLA MODEL 3
# ℹ 87,094 more rows
ev_data$Model_Year == 2020: Subsets rows where the year is 2020.
Try subsetting the data using different conditions and explore the resulting subsets.
Exercise:
I want a car that is cheap or is cheap to run. Find the vehicles that are in Texas but not in Richardson, if their Electric_Range is larger than 20 or the Base MSRP is lower than 10000. Return the Make, Model and Electric Vehicle Type.
# A tibble: 22 × 3
Make Model `Electric Vehicle Type`
<chr> <chr> <chr>
1 TESLA MODEL X Battery Electric Vehicle (BEV)
2 TESLA MODEL Y Battery Electric Vehicle (BEV)
3 TESLA MODEL 3 Battery Electric Vehicle (BEV)
4 FORD MUSTANG MACH-E Battery Electric Vehicle (BEV)
5 TOYOTA PRIUS PRIME Plug-in Hybrid Electric Vehicle (PHEV)
6 TESLA MODEL Y Battery Electric Vehicle (BEV)
7 TESLA MODEL Y Battery Electric Vehicle (BEV)
8 TESLA MODEL 3 Battery Electric Vehicle (BEV)
9 TESLA MODEL S Battery Electric Vehicle (BEV)
10 FORD FUSION Plug-in Hybrid Electric Vehicle (PHEV)
# ℹ 12 more rows
1.2 Basic Data Transformation
Data transformation involves modifying data to fit a certain format or to prepare it for analysis. This includes operations like renaming columns, creating new variables, and changing data types.
Detailed Explanation:
Renaming Columns: Use the names() function or dplyr::rename().
Creating New Variables: Add new columns based on calculations or conditions.
Changing Data Types: Convert columns to appropriate data types using functions like as.numeric(), as.factor(), etc.
Code
# Renaming columnsnames(ev_data)[names(ev_data) =="Postal_Code"] <-"Zip_Code"# Creating new variablesev_data$Vehicle_Age <-2025- ev_data$Model_Year# Changing data typesev_data$Make <-as.factor(ev_data$Make)summary(ev_data)
VIN (1-10) County City State
Length:194232 Length:194232 Length:194232 Length:194232
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Zip_Code Model_Year Make Model
Min. : 1731 Min. :1997 TESLA :85320 Length:194232
1st Qu.:98052 1st Qu.:2019 CHEVROLET:14578 Class :character
Median :98125 Median :2022 NISSAN :14360 Mode :character
Mean :98176 Mean :2021 FORD :10266
3rd Qu.:98372 3rd Qu.:2023 KIA : 8497
Max. :99577 Max. :2025 BMW : 8150
NA's :9 (Other) :53061
Electric Vehicle Type Clean Alternative Fuel Vehicle (CAFV) Eligibility
Length:194232 Length:194232
Class :character Class :character
Mode :character Mode :character
Electric_Range Base MSRP Legislative District DOL Vehicle ID
Min. : 0.00 Min. : 0.0 Min. : 1.00 Min. : 4385
1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:17.00 1st Qu.:187225086
Median : 0.00 Median : 0.0 Median :33.00 Median :233940241
Mean : 54.84 Mean : 978.7 Mean :29.01 Mean :224892309
3rd Qu.: 68.00 3rd Qu.: 0.0 3rd Qu.:42.00 3rd Qu.:260115926
Max. :337.00 Max. :845000.0 Max. :49.00 Max. :479254772
NA's :2 NA's :2 NA's :432
Vehicle Location Electric Utility 2020 Census Tract Vehicle_Age
Length:194232 Length:194232 Min. :1.001e+09 Min. : 0.000
Class :character Class :character 1st Qu.:5.303e+10 1st Qu.: 2.000
Mode :character Mode :character Median :5.303e+10 Median : 3.000
Mean :5.298e+10 Mean : 4.218
3rd Qu.:5.305e+10 3rd Qu.: 6.000
Max. :5.602e+10 Max. :28.000
NA's :9
- names(): Retrieves or sets the names of an object’s elements. - as.factor(): Converts a variable to factor
Exercise:
Normalize the scores to a Electric_Range between 0 and 1 and store these in a new column named normalized_range.
VIN (1-10)
0
County
9
City
9
State
0
Zip_Code
9
Model_Year
0
Make
0
Model
0
Electric Vehicle Type
0
Clean Alternative Fuel Vehicle (CAFV) Eligibility
0
Electric_Range
0
Base MSRP
0
Legislative District
432
DOL Vehicle ID
0
Vehicle Location
13
Electric Utility
9
2020 Census Tract
9
Vehicle_Age
0
normalized_range
0
Code
# Indicates which elements are missing within a specific columnev_data[is.na(ev_data$County ),]
# A tibble: 9 × 19
`VIN (1-10)` County City State Zip_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <fct> <chr>
1 1G1RB6S53J <NA> <NA> BC NA 2018 CHEVROLET VOLT
2 7SAYGDEE4N <NA> <NA> CA NA 2022 TESLA MODEL Y
3 7SAYGAEE3P <NA> <NA> CA NA 2023 TESLA MODEL Y
4 3FA6P0PU8H <NA> <NA> KS NA 2017 FORD FUSION
5 5YJXCAE24H <NA> <NA> BC NA 2017 TESLA MODEL X
6 5YJ3E1EA5K <NA> <NA> BC NA 2019 TESLA MODEL 3
7 WBAJA9C50K <NA> <NA> AE NA 2019 BMW 530E
8 YV4H60CL5N <NA> <NA> GA NA 2022 VOLVO XC90
9 5YJ3E1EA1J <NA> <NA> AL NA 2018 TESLA MODEL 3
# ℹ 11 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>, Vehicle_Age <dbl>, normalized_range <dbl>
Code
# Indicates which elements are not missing within a specific columnev_data[!is.na(ev_data$City ),]
# A tibble: 194,221 × 19
`VIN (1-10)` County City State Zip_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <fct> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRAN…
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODE…
4 1G1RC6S5XH Kitsap Port Orchard WA 98367 2017 CHEVROLET VOLT
5 5UXTA6C09P Snohomish Monroe WA 98272 2023 BMW X5
6 1FMCU0EZXN Yakima Moxee WA 98936 2022 FORD ESCA…
7 5YJSA1DNXD Thurston Olympia WA 98506 2013 TESLA MODE…
8 1N4AZ0CP8F Snohomish Monroe WA 98272 2015 NISSAN LEAF
9 WP1AE2A21J Snohomish Everett WA 98208 2018 PORSCHE CAYE…
10 1N4BZ1BV4N Snohomish Everett WA 98208 2022 NISSAN LEAF
# ℹ 194,211 more rows
# ℹ 11 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>, Vehicle_Age <dbl>, normalized_range <dbl>
Code
# Removes missing values of a specific column by subsettingev_data <- ev_data[!is.na(ev_data$City),]# Check na Removalsum(is.na(ev_data))
[1] 427
Code
# Removes ALL missing values in data framena.omit(ev_data)
# A tibble: 193,794 × 19
`VIN (1-10)` County City State Zip_Code Model_Year Make Model
<chr> <chr> <chr> <chr> <dbl> <dbl> <fct> <chr>
1 1C4RJXN66R Snohomish Everett WA 98204 2024 JEEP WRAN…
2 KNDJX3AEXG King Renton WA 98058 2016 KIA SOUL
3 5YJ3E1EA3L King Seattle WA 98125 2020 TESLA MODE…
4 1G1RC6S5XH Kitsap Port Orchard WA 98367 2017 CHEVROLET VOLT
5 5UXTA6C09P Snohomish Monroe WA 98272 2023 BMW X5
6 1FMCU0EZXN Yakima Moxee WA 98936 2022 FORD ESCA…
7 5YJSA1DNXD Thurston Olympia WA 98506 2013 TESLA MODE…
8 1N4AZ0CP8F Snohomish Monroe WA 98272 2015 NISSAN LEAF
9 WP1AE2A21J Snohomish Everett WA 98208 2018 PORSCHE CAYE…
10 1N4BZ1BV4N Snohomish Everett WA 98208 2022 NISSAN LEAF
# ℹ 193,784 more rows
# ℹ 11 more variables: `Electric Vehicle Type` <chr>,
# `Clean Alternative Fuel Vehicle (CAFV) Eligibility` <chr>,
# Electric_Range <dbl>, `Base MSRP` <dbl>, `Legislative District` <dbl>,
# `DOL Vehicle ID` <dbl>, `Vehicle Location` <chr>, `Electric Utility` <chr>,
# `2020 Census Tract` <dbl>, Vehicle_Age <dbl>, normalized_range <dbl>
Code
# Check na Removalsum(is.na(ev_data))
[1] 427
Exercise:
Omit the missing value in MSRP column and run the exercises again, tell me if there is any difference.
1.4 Introduction to the dplyr Package
dplyr is a powerful package for data manipulation in R. It provides a set of functions (verbs) that help in transforming and summarizing data easily.
dplyr is a powerful R package for data manipulation. It provides intuitive and efficient functions for transforming data frames, making it a staple in data analysis workflows. This session will cover key dplyr functions, including group_by(), summarize(), joining data frames, and reshaping data with tidyr.
# Grouping data by Gendergrouped_data <- life_expectancy_data %>%group_by(Gender)
The group_by() function groups the dataset by the specified variable (Gender). This operation is essential for performing group-wise computations, such as calculating averages or counts.
Code
# Summarizing Average Life Expectancy by Gendersummary_data <- grouped_data %>%summarize(Average_Life_Expectancy =mean(Average_Life_Expectancy, na.rm =TRUE),Average_Death_Rate =mean(Age_adjusted_Death_Rate, na.rm =TRUE))# Viewing the summarized datasummary_data
# A tibble: 2 × 3
Gender Average_Life_Expectancy Average_Death_Rate
<chr> <dbl> <dbl>
1 Female 69.6 1288.
2 Male 64.4 1662.
The summarize() function calculates summary statistics for each group created by group_by().
Here, it computes the average life expectancy and average death rate for each gender, ignoring NA values with na.rm = TRUE.
This function is highly customizable, allowing for a wide range of summary calculations.
Joining data frames is crucial when working with related datasets.
dplyr provides several join functions, such as inner_join(), left_join(), right_join(), and full_join(). These functions merge datasets based on common keys.
Code
# Creating a sample dataset for joining based on the year 2000data_to_join <-data.frame(Year =1800:2010,Year_Category =ifelse(1800:2010<2010, "Before 2000", "After 2000"))# Joining data framesjoined_data_l <-left_join(life_expectancy_data, data_to_join, by =c('Year'))joined_data_r <-right_join(life_expectancy_data, data_to_join, by =c('Year'))joined_data_i <-inner_join(life_expectancy_data, data_to_join, by ="Year")joined_data_o <-full_join(life_expectancy_data, data_to_join, by ="Year")# Viewing the joined datahead(joined_data_l)
Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female 48.3 2410.4 Before 2000
2 1901 Female 50.6 2350.5 Before 2000
3 1902 Female 53.4 2162.8 Before 2000
4 1903 Female 52.0 2250.6 Before 2000
5 1904 Female 49.1 2358.8 Before 2000
6 1905 Female 50.2 2287.7 Before 2000
Code
head(joined_data_r)
Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female 48.3 2410.4 Before 2000
2 1901 Female 50.6 2350.5 Before 2000
3 1902 Female 53.4 2162.8 Before 2000
4 1903 Female 52.0 2250.6 Before 2000
5 1904 Female 49.1 2358.8 Before 2000
6 1905 Female 50.2 2287.7 Before 2000
Code
head(joined_data_i)
Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female 48.3 2410.4 Before 2000
2 1901 Female 50.6 2350.5 Before 2000
3 1902 Female 53.4 2162.8 Before 2000
4 1903 Female 52.0 2250.6 Before 2000
5 1904 Female 49.1 2358.8 Before 2000
6 1905 Female 50.2 2287.7 Before 2000
Code
head(joined_data_o)
Year Gender Average_Life_Expectancy Age_adjusted_Death_Rate Year_Category
1 1900 Female 48.3 2410.4 Before 2000
2 1901 Female 50.6 2350.5 Before 2000
3 1902 Female 53.4 2162.8 Before 2000
4 1903 Female 52.0 2250.6 Before 2000
5 1904 Female 49.1 2358.8 Before 2000
6 1905 Female 50.2 2287.7 Before 2000
additional_data_2000 specifies categories for “Before 2000” and “After 2000” for each gender.
left_join() merges life_expectancy_data and additional_data_2000 based on the Gender and Year_Category columns.
In a left join, all rows from the left data frame are kept, with matching rows from the right data frame added.
merge() Or you can use merge().
Parameters: merge(x, y, by = “common_column”, all = FALSE, all.x = FALSE, all.y = FALSE,…)
Session 2: Hands-on Exercises
2.1 Subsetting and Filtering Data
Subset the US_Life_expectancy data for a specific state and year.
Select only the columns “Year” and “Age_adjusted_Death_Rate” from US_Life_expectancy data.
Create a new column in the Electric Vehicle data representing the vehicle’s age.
Note
Code
# Your code herenames(life_data)[names(life_data) =="Death_Rate"] <-"Age_adjusted_Death_Rate"ev_data$Vehicle_Age <-2025- ev_data$Model_Year
Session 3: Wrapping Up and Q&A
In this session, we’ve covered essential data manipulation techniques, including subsetting, transforming data, and using the dplyr package. These skills are critical for preparing and analyzing data effectively.
Continue exploring different data manipulation techniques and apply them to various datasets to enhance your proficiency.
Q&A: Feel free to ask questions or share your experiences with data manipulation in R. Let’s discuss any challenges or interesting findings from your exercises.
---title: "EPPS Math Coding Camp"subtitle: "Basic Data Manipulation"author: "Brennan Stout"date: "August 5, 2025"format: html: toc: true toc-depth: 3 code-fold: show code-tools: true highlight-style: githubeditor: markdown: wrap: 72---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)```# Session 1: Basic Data Manipulation## 1.1.1 Slicing/IndexWe learned access elements of matrix and vectors in the previoussessions. Indexing in Dataframes is similar. Remember, R uses 1-basedindexing (the first element is at position 1, not 0).### Examples:A quick review of the DataFrame```{r}# Load data ev_data <- readxl::read_excel("Electric_Vehicle_Population_Data.xlsx")head(ev_data)```Observe what is the return of the following codes:```{r}ev_data[1, 1]ev_data[, 2]ev_data[, c(2,3)]ev_data[1, ]ev_data[1:3, ]```Mind the type of the output. Observe what is the return of the followingcodes and see the difference:```{r}ev_data[[1, 1]]class(ev_data[[1, 1]])class(ev_data[1, 1])```What will happen if you run this line and why?```{r}# ev_data[[, 2]]```### Exercise:- 1. Load the Electric Vehicle Population Data (Excel) on your machine- 2. Return the Model_Year Columns using only indexing::: {.callout-note collapse="true"}```{r}ev_data[,6]```:::## 1.1.2 Using Variable Names```{r}# Subsetting specific columnhead(ev_data$Make)# Subsetting specific columnsev_data[, c("Make", "Model", "Electric_Range")]```### Exercise:- Get the 10th row's 'City' Column value.::: {.callout-note collapse="true"}```{r}ev_data[10, 'City']```:::## 1.1.3 Subsetting Data With conditionSubsetting is the process of extracting a portion of a dataset based onspecific conditions. This can be useful for focusing on particulargroups or conditions within your data.### Detailed Explanation:- **Row and Column Subsetting**: Use indexing to subset specific rows and columns.- **Logical Conditions**: Subset based on logical conditions using operators like `==`, `!=`, `<`, `>`, `&`, `|`.### Examples:```{r}# Subsetting rows based on a conditionev_subset <- ev_data[ev_data$Model_Year ==2020, ]# Subsetting rows and columnsev_subset_cols <- ev_data[ev_data$Model_Year ==2020, c("Make", "Electric_Range")]```# Using logic operator to do complicated selection```{r}# Find new cars with high electric rangeev_data[ev_data$Model_Year >2018& ev_data$Electric_Range >=100 , c("Make", "Model")]# Find cars that is near me (Richardson is a city in Dallas and Collin counties)ev_data[(ev_data$County =='Collin'| ev_data$County =='Dallas'), c("Make", "Model")]# Find cars outside TX that are made by Tesla or is expensiveev_data[ev_data$State !='TX'& (ev_data$Make =='TESLA'| ev_data$`Base MSRP`>=10000), c("Make", "Model")]```<br>- ev_data\$Model_Year == 2020: Subsets rows where the year is 2020. <br>- <br> Try subsetting the data using different conditions and explore the resulting subsets. <br>### Exercise:- 1. I want a car that is cheap or is cheap to run. Find the vehicles that are in Texas but not in Richardson, if their Electric_Range is larger than 20 or the Base MSRP is lower than 10000. Return the Make, Model and Electric Vehicle Type.::: {.callout-note collapse="true"}```{r}ev_data[(ev_data$State =='TX'& ev_data$City !='Richardson') & (ev_data$Electric_Range >=200| ev_data$`Base MSRP`<=10000),c("Make", "Model", "Electric Vehicle Type")]```:::## 1.2 Basic Data TransformationData transformation involves modifying data to fit a certain format orto prepare it for analysis. This includes operations like renamingcolumns, creating new variables, and changing data types. <br>#### Detailed Explanation:- **Renaming Columns**: Use the names() function or dplyr::rename(). <br>- **Creating New Variables**: Add new columns based on calculations or conditions. <br>- **Changing Data Types**: Convert columns to appropriate data types using functions like as.numeric(), as.factor(), etc. <br>```{r}# Renaming columnsnames(ev_data)[names(ev_data) =="Postal_Code"] <-"Zip_Code"# Creating new variablesev_data$Vehicle_Age <-2025- ev_data$Model_Year# Changing data typesev_data$Make <-as.factor(ev_data$Make)summary(ev_data)```<br> - **names()**: Retrieves or sets the names of an object's elements.<br> - **as.factor()**: Converts a variable to factor<br> <br>### Exercise:- Normalize the scores to a Electric_Range between 0 and 1 and store these in a new column named normalized_range.hint:$$\text{normalized value} = \frac{\text{value} - \min(\text{value})}{\max(\text{value}) - \min(\text{value})}$$::: {.callout-note collapse="true"}```{r}class(ev_data$Electric_Range)ev_data <- ev_data[!is.na(ev_data$Electric_Range),]ev_data$normalized_range <- (ev_data$Electric_Range -min(ev_data$Electric_Range)) / (max(ev_data$Electric_Range) -min(ev_data$Electric_Range))ev_data$normalized_range```:::## 1.3 Missing Values```{r}# Count number na in the dfsum(is.na(ev_data))# Count na of each colcolSums(is.na(ev_data))# Indicates which elements are missing within a specific columnev_data[is.na(ev_data$County ),]# Indicates which elements are not missing within a specific columnev_data[!is.na(ev_data$City ),]# Removes missing values of a specific column by subsettingev_data <- ev_data[!is.na(ev_data$City),]# Check na Removalsum(is.na(ev_data))# Removes ALL missing values in data framena.omit(ev_data)# Check na Removalsum(is.na(ev_data))```### Exercise:- Omit the missing value in MSRP column and run the exercises again, tell me if there is any difference.## 1.4 Introduction to the dplyr Packagedplyr is a powerful package for data manipulation in R. It provides aset of functions (verbs) that help in transforming and summarizing dataeasily. <br>**dplyr** is a powerful R package for data manipulation. It providesintuitive and efficient functions for transforming data frames, makingit a staple in data analysis workflows. This session will cover key`dplyr` functions, including `group_by()`, `summarize()`, joining dataframes, and reshaping data with `tidyr`.```{r, warning=FALSE}# Load necessary packages# install.packages("dplyr")library(dplyr)# Reading the CSV filelife_expectancy_data <- read.csv("US_Life_expectancy.csv")# View the first few rows of the datahead(life_expectancy_data)``````{r}# Grouping data by Gendergrouped_data <- life_expectancy_data %>%group_by(Gender)```The **group_by()** function groups the dataset by the specified variable(Gender). This operation is essential for performing group-wisecomputations, such as calculating averages or counts.```{r}# Summarizing Average Life Expectancy by Gendersummary_data <- grouped_data %>%summarize(Average_Life_Expectancy =mean(Average_Life_Expectancy, na.rm =TRUE),Average_Death_Rate =mean(Age_adjusted_Death_Rate, na.rm =TRUE))# Viewing the summarized datasummary_data```The **summarize()** function calculates summary statistics for eachgroup created by **group_by()**.Here, it computes the average life expectancy and average death rate foreach gender, ignoring NA values with **na.rm = TRUE.**This function is highly customizable, allowing for a wide range ofsummary calculations.Joining data frames is crucial when working with related datasets.dplyr provides several join functions, such as **inner_join()**,**left_join()**, **right_join()**, and **full_join()**. These functionsmerge datasets based on common keys.```{r}# Creating a sample dataset for joining based on the year 2000data_to_join <-data.frame(Year =1800:2010,Year_Category =ifelse(1800:2010<2010, "Before 2000", "After 2000"))# Joining data framesjoined_data_l <-left_join(life_expectancy_data, data_to_join, by =c('Year'))joined_data_r <-right_join(life_expectancy_data, data_to_join, by =c('Year'))joined_data_i <-inner_join(life_expectancy_data, data_to_join, by ="Year")joined_data_o <-full_join(life_expectancy_data, data_to_join, by ="Year")# Viewing the joined datahead(joined_data_l)head(joined_data_r)head(joined_data_i)head(joined_data_o)```**additional_data_2000** specifies categories for "Before 2000" and"After 2000" for each gender.**left_join()** merges life_expectancy_data and additional_data_2000based on the Gender and Year_Category columns.In a left join, all rows from the left data frame are kept, withmatching rows from the right data frame added.**merge()** Or you can use merge().Parameters: merge(x, y, by = "common_column", all = FALSE, all.x =FALSE, all.y = FALSE,...)# Session 2: Hands-on Exercises## 2.1 Subsetting and Filtering Data1) Subset the US_Life_expectancy data for a specific state and year.2) Select only the columns "Year" and "Age_adjusted_Death_Rate" from US_Life_expectancy data.::: {.callout-note collapse="true"}```{r}# Your code herelife_data <-read.csv("US_Life_expectancy.csv")life_gender_year <- life_data[life_data$Gender =="Gender"& life_data$Year ==2000, ]life_make_model <- life_data[, c("Year", "Age_adjusted_Death_Rate")]```<br>:::## 2.2 Data Transformation1) Rename a column in the US_Life_expectancy data. <br>2) Create a new column in the Electric Vehicle data representing the vehicle's age. <br>::: {.callout-note collapse="true"}```{r}# Your code herenames(life_data)[names(life_data) =="Death_Rate"] <-"Age_adjusted_Death_Rate"ev_data$Vehicle_Age <-2025- ev_data$Model_Year```<br>:::# Session 3: Wrapping Up and Q&AIn this session, we've covered essential data manipulation techniques,including subsetting, transforming data, and using the dplyr package.These skills are critical for preparing and analyzing data effectively.<br>Continue exploring different data manipulation techniques and apply themto various datasets to enhance your proficiency. <br>**Q&A**: Feel free to ask questions or share your experiences with datamanipulation in R. Let's discuss any challenges or interesting findingsfrom your exercises. <br>### Reference- [data programming with R](https://datageneration.io/dataprogrammingwithr/intro)- Chicago Harris School Coding Camp- [The U.S. government's open data portal](https://data.gov/)