Grouping data

In data analysis, we often need to group data based on certain variables and perform analysis at the group level. This could involve calculating group-level summary statistics, or applying more complex transformations or models to each group. The package ‘dplyr’ in R provides powerful and flexible tools for this purpose.

First, we need to install and load the ‘dplyr’ package.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Suppose we have a data frame with data on employees in a company:

# Create a data frame
df <- data.frame(
  department = c("Sales", "Marketing", "Sales", "HR", "Marketing", "HR"),
  employee = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank"),
  salary = c(60000, 65000, 70000, 75000, 55000, 80000)
)

We can group the data by department using the group_by() function:

df_grouped <- df %>%
  group_by(department)

Now we can perform group-level operations. For instance, we can calculate the average salary in each department:

df_summary <- df_grouped %>%
  summarise(avg_salary = mean(salary))

print(df_summary)

## # A tibble: 3 × 2
##   department avg_salary
##   <chr>           <dbl>
## 1 HR              77500
## 2 Marketing       60000
## 3 Sales           65000

#  department avg_salary
#1         HR      77500
#2  Marketing      60000
#3      Sales      65000

In this example, df_summary contains the average salary for each department. Note the use of the pipe operator (%>%) to chain operations together, which makes the code more readable.

We could also perform more complex group-level analyses, like fitting separate regression models to each group.

Remember, when working with grouped data frames, the grouped variable (in this case, ‘department’) is treated as a factor.

Grouping data

Juan Andrés Cabral