We start with a CSV file containing student ID, exercise names, and grades. The goal is to transform this data into a more accessible format: one row per student, with columns representing each exercise and the corresponding grades. The initial format of our student grade data had one row for each exercise a student completed, which can make analysis cumbersome if we want a comprehensive view of each student's performance across all exercises.
import pandas as pd
# Load the dataset
df = pd.read_csv('C:/Users/juanc/Downloads/ejercitaciones/grades.csv')
# Preview the first few rows of the dataset
df.head()
| ID | Grade | Exercise | |
|---|---|---|---|
| 123 | 1143843875 | 100 | 10 |
| 124 | 1140313900 | 90 | 10 |
| 125 | 1144375276 | 85 | 10 |
| 126 | 1144369034 | 60 | 10 |
| 127 | 1142733503 | 95 | 10 |
# Preview the last few rows of the dataset
df.tail()
| ID | Grade | Exercise | |
|---|---|---|---|
| 123 | 1143843875 | 100 | 10 |
| 124 | 1140313900 | 90 | 10 |
| 125 | 1144375276 | 85 | 10 |
| 126 | 1144369034 | 60 | 10 |
| 127 | 1142733503 | 95 | 10 |
# Pivot the dataframe
df_pivot = df.pivot_table(index=['ID'], columns='Exercise', values='Grade', fill_value='0')
# Reset the index
df_pivot.reset_index(inplace=True)
df_pivot.head()
| Exercise | ID | 4 | 5 | 8 | 10 |
|---|---|---|---|---|---|
| 0 | 1132123999 | 85.0 | 100.0 | 100.0 | 90.0 |
| 1 | 1134008840 | 90.0 | 85.0 | 80.0 | 100.0 |
| 2 | 1134170686 | 75.0 | 70.0 | 100.0 | 100.0 |
| 3 | 1135663952 | 90.0 | 50.0 | 100.0 | 75.0 |
| 4 | 1135974911 | 90.0 | 70.0 | 100.0 | 90.0 |
We use the pivot_table function in pandas to accomplish this. The parameters we pass to this function are:
index=['ID']: This is the column we want to keep as the identifier for each row. In our case, we use the student's ID. columns='Exercise': This is the column we want to pivot. Each unique value in this column will become a new column in our pivoted table. values='Grade': This is the column that contains the values we want to fill in our new columns. In our case, this is the grade the student received. fill_value='0': This is the value that will be used to fill in any cell for which we don't have data. In our case, if a student didn't participate in an exercise, a '0' is placed instead. After the pivot, each row represents a unique student, and each exercise has its own column with the student's grade for that exercise. This format is much more convenient for analyzing and comparing student performance across different exercises.
# Rename the columns
df_pivot.rename(columns={4: 'Exercise 4', 5: 'Exercise 5', 8: 'Exercise 8', 10: 'Exercise 10'}, inplace=True)
df_pivot.head()
| Exercise | ID | Exercise 4 | Exercise 5 | Exercise 8 | Exercise 10 |
|---|---|---|---|---|---|
| 0 | 1132123999 | 85.0 | 100.0 | 100.0 | 90.0 |
| 1 | 1134008840 | 90.0 | 85.0 | 80.0 | 100.0 |
| 2 | 1134170686 | 75.0 | 70.0 | 100.0 | 100.0 |
| 3 | 1135663952 | 90.0 | 50.0 | 100.0 | 75.0 |
| 4 | 1135974911 | 90.0 | 70.0 | 100.0 | 90.0 |
# Save
df_pivot.to_csv('C:/Users/juanc/Downloads/ejercitaciones/grades_processed.csv', index=False)