In data analysis, we often need to group data based on certain variables and perform analysis at the group level. This could involve calculating group-level summary statistics, or applying more complex transformations or models to each group. The package ‘dplyr’ in R provides powerful and flexible tools for this purpose.
First, we need to install and load the ‘dplyr’ package.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Suppose we have a data frame with data on employees in a company:
# Create a data frame
df <- data.frame(
department = c("Sales", "Marketing", "Sales", "HR", "Marketing", "HR"),
employee = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank"),
salary = c(60000, 65000, 70000, 75000, 55000, 80000)
)
We can group the data by department using the group_by()
function:
df_grouped <- df %>%
group_by(department)
Now we can perform group-level operations. For instance, we can calculate the average salary in each department:
df_summary <- df_grouped %>%
summarise(avg_salary = mean(salary))
print(df_summary)
## # A tibble: 3 × 2
## department avg_salary
## <chr> <dbl>
## 1 HR 77500
## 2 Marketing 60000
## 3 Sales 65000
# department avg_salary
#1 HR 77500
#2 Marketing 60000
#3 Sales 65000
In this example, df_summary contains the average salary
for each department. Note the use of the pipe operator
(%>%) to chain operations together, which makes the code
more readable.
We could also perform more complex group-level analyses, like fitting separate regression models to each group.
Remember, when working with grouped data frames, the grouped variable (in this case, ‘department’) is treated as a factor.