In this tutorial, we cover some basic commands for conducting an introductory analysis of a dataset. The goal of this tutorial is to provide me with handy snippets of code that I can use whenever I need to analyze a dataset. The dataset we will be analyzing here is about video games and was obtained from Kaggle: Popular Video Games Dataset by Matheus Fonseca Chaves.
import pandas as pd
# Load the data
data = pd.read_csv('C:/Users/juanc/Downloads/archive/backloggd_games.csv')
# Show the first few rows of the dataframe
data.head()
| Unnamed: 0 | Title | Release_Date | Developers | Summary | Platforms | Genres | Rating | Plays | Playing | Backlogs | Wishlist | Lists | Reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Elden Ring | Feb 25, 2022 | ['FromSoftware', 'Bandai Namco Entertainment'] | Elden Ring is a fantasy, action and open world... | ['Windows PC', 'PlayStation 4', 'Xbox One', 'P... | ['Adventure', 'RPG'] | 4.5 | 21K | 4.1K | 5.6K | 5.5K | 4.6K | 3K |
| 1 | 1 | The Legend of Zelda: Breath of the Wild | Mar 03, 2017 | ['Nintendo', 'Nintendo EPD Production Group No... | The Legend of Zelda: Breath of the Wild is the... | ['Wii U', 'Nintendo Switch'] | ['Adventure', 'Puzzle'] | 4.4 | 35K | 3.1K | 5.6K | 3K | 5.1K | 3K |
| 2 | 2 | Hades | Dec 07, 2018 | ['Supergiant Games'] | A rogue-lite hack and slash dungeon crawler in... | ['Windows PC', 'Mac', 'PlayStation 4', 'Xbox O... | ['Adventure', 'Brawler', 'Indie', 'RPG'] | 4.3 | 25K | 3.5K | 7.3K | 4K | 3.2K | 2.1K |
| 3 | 3 | Hollow Knight | Feb 24, 2017 | ['Team Cherry'] | A 2D metroidvania with an emphasis on close co... | ['Windows PC', 'Mac', 'Linux', 'Nintendo Switch'] | ['Adventure', 'Indie', 'Platform'] | 4.4 | 25K | 2.7K | 9.6K | 2.6K | 3.4K | 2.1K |
| 4 | 4 | Undertale | Sep 15, 2015 | ['tobyfox', '8-4'] | A small child falls into the Underground, wher... | ['Windows PC', 'Mac', 'Linux', 'PlayStation 4'... | ['Adventure', 'Indie', 'RPG', 'Turn Based Stra... | 4.2 | 32K | 728 | 5.7K | 2.1K | 3.9K | 2.5K |
# Shape of the dataset
print("Shape of the dataset:", data.shape)
Shape of the dataset: (60000, 14)
The dataset contains 60,000 rows and 14 columns.
# Data types of each column
print("Data types of each column:")
print(data.dtypes)
# Basic statistics
data.describe(include='all')
Data types of each column: Unnamed: 0 int64 Title object Release_Date object Developers object Summary object Platforms object Genres object Rating float64 Plays object Playing object Backlogs object Wishlist object Lists object Reviews object dtype: object
| Unnamed: 0 | Title | Release_Date | Developers | Summary | Platforms | Genres | Rating | Plays | Playing | Backlogs | Wishlist | Lists | Reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 60000.000000 | 60000 | 60000 | 60000 | 55046 | 60000 | 60000 | 25405.000000 | 60000 | 60000 | 60000 | 60000 | 60000 | 60000 |
| unique | NaN | 40985 | 8956 | 18356 | 38411 | 3169 | 1751 | NaN | 1053 | 453 | 923 | 800 | 810 | 606 |
| top | NaN | Date A Live Twin Edition: Rio Reincarnation | TBD | [] | Date A Live Twin Edition: Rio Reincarnation is... | ['Windows PC'] | [] | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| freq | NaN | 24 | 8019 | 18091 | 24 | 13757 | 6741 | NaN | 13596 | 45518 | 19786 | 24847 | 17942 | 36890 |
| mean | 29999.500000 | NaN | NaN | NaN | NaN | NaN | NaN | 3.033171 | NaN | NaN | NaN | NaN | NaN | NaN |
| std | 17320.652413 | NaN | NaN | NaN | NaN | NaN | NaN | 0.735573 | NaN | NaN | NaN | NaN | NaN | NaN |
| min | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 0.300000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | 14999.750000 | NaN | NaN | NaN | NaN | NaN | NaN | 2.600000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | 29999.500000 | NaN | NaN | NaN | NaN | NaN | NaN | 3.100000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | 44999.250000 | NaN | NaN | NaN | NaN | NaN | NaN | 3.500000 | NaN | NaN | NaN | NaN | NaN | NaN |
| max | 59999.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 5.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
The game "Date A Live Twin Edition: Rio Reincarnation" appears most frequently in the dataset.\ There are 8,956 unique release dates and 'TBD' (To Be Determined) is the most common value.\ There are 18,356 unique developers and an empty list appears most frequently.\ 'Windows PC' is the most common platform the games are available on.\ There are 1,751 unique genres and an empty list appears most frequently. \ The average game rating is approximately 3.03 with a standard deviation of about 0.74. Ratings range from 0.3 to 5.
# Remove the 'Unnamed: 0' column
data = data.drop(['Unnamed: 0'], axis=1)
# Convert 'Release_Date' to datetime
data['Release_Date'] = pd.to_datetime(data['Release_Date'], errors='coerce')
# Convert 'Plays', 'Playing', 'Backlogs', 'Wishlist', 'Lists', 'Reviews' to integers
# We first need to clean the 'K' suffix and convert the data to numeric type
def clean_number(number_str):
if 'K' in number_str:
return float(number_str.replace('K', '')) * 1000
else:
return float(number_str)
for column in ['Plays', 'Playing', 'Backlogs', 'Wishlist', 'Lists', 'Reviews']:
data[column] = data[column].apply(clean_number)
# Check the cleaned data
data.head()
| Title | Release_Date | Developers | Summary | Platforms | Genres | Rating | Plays | Playing | Backlogs | Wishlist | Lists | Reviews | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Elden Ring | 2022-02-25 | ['FromSoftware', 'Bandai Namco Entertainment'] | Elden Ring is a fantasy, action and open world... | ['Windows PC', 'PlayStation 4', 'Xbox One', 'P... | ['Adventure', 'RPG'] | 4.5 | 21000.0 | 4100.0 | 5600.0 | 5500.0 | 4600.0 | 3000.0 |
| 1 | The Legend of Zelda: Breath of the Wild | 2017-03-03 | ['Nintendo', 'Nintendo EPD Production Group No... | The Legend of Zelda: Breath of the Wild is the... | ['Wii U', 'Nintendo Switch'] | ['Adventure', 'Puzzle'] | 4.4 | 35000.0 | 3100.0 | 5600.0 | 3000.0 | 5100.0 | 3000.0 |
| 2 | Hades | 2018-12-07 | ['Supergiant Games'] | A rogue-lite hack and slash dungeon crawler in... | ['Windows PC', 'Mac', 'PlayStation 4', 'Xbox O... | ['Adventure', 'Brawler', 'Indie', 'RPG'] | 4.3 | 25000.0 | 3500.0 | 7300.0 | 4000.0 | 3200.0 | 2100.0 |
| 3 | Hollow Knight | 2017-02-24 | ['Team Cherry'] | A 2D metroidvania with an emphasis on close co... | ['Windows PC', 'Mac', 'Linux', 'Nintendo Switch'] | ['Adventure', 'Indie', 'Platform'] | 4.4 | 25000.0 | 2700.0 | 9600.0 | 2600.0 | 3400.0 | 2100.0 |
| 4 | Undertale | 2015-09-15 | ['tobyfox', '8-4'] | A small child falls into the Underground, wher... | ['Windows PC', 'Mac', 'Linux', 'PlayStation 4'... | ['Adventure', 'Indie', 'RPG', 'Turn Based Stra... | 4.2 | 32000.0 | 728.0 | 5700.0 | 2100.0 | 3900.0 | 2500.0 |
The unnecessary 'Unnamed: 0' column has been removed.\ The 'Release_Date' column has been converted to datetime format.\ The 'Plays', 'Playing', 'Backlogs', 'Wishlist', 'Lists', 'Reviews' columns have been converted from string to float. Strings with a 'K' suffix have been converted to their numerical equivalents (e.g., '21K' to 21000).
# Top 10 games with the highest ratings
top_rated_games = data.sort_values('Rating', ascending=False).head(10)
top_rated_games[['Title', 'Rating']]
| Title | Rating | |
|---|---|---|
| 9549 | 700,000 Games | 5.0 |
| 46118 | Etherlords | 5.0 |
| 46046 | Etherlords | 5.0 |
| 3928 | Ys X: Nordics | 5.0 |
| 4621 | Jack Jeanne | 5.0 |
| 25984 | Animals Fight | 5.0 |
| 10434 | Aventura Copilului Albastru si Urat | 5.0 |
| 19901 | Goalkeeper Training VR | 5.0 |
| 9656 | 700,000 Games | 5.0 |
| 46153 | Etherlords | 5.0 |
Note that some games appear more than once in the top 10 list. This is because they might be released on different platforms or have different versions.
# Top 10 games with the highest number of plays
most_played_games = data.sort_values('Plays', ascending=False).head(10)
most_played_games[['Title', 'Plays']]
| Title | Plays | |
|---|---|---|
| 5 | Minecraft | 38000.0 |
| 1 | The Legend of Zelda: Breath of the Wild | 35000.0 |
| 23 | Grand Theft Auto V | 35000.0 |
| 13 | Portal 2 | 34000.0 |
| 4 | Undertale | 32000.0 |
| 22 | Portal | 32000.0 |
| 35 | Super Smash Bros. Ultimate | 29000.0 |
| 11 | Among Us | 29000.0 |
| 21 | Super Mario Odyssey | 29000.0 |
| 56 | The Elder Scrolls V: Skyrim | 26000.0 |
These are some of the most popular games based on the number of plays.
Next, let's look at the most popular genres. For this, we need to first clean the 'Genres' column as it contains lists represented as strings. We will then count the occurrence of each genre.
import ast
# Clean the 'Genres' column
data['Genres'] = data['Genres'].apply(ast.literal_eval)
# Count the occurrence of each genre
genre_counts = data['Genres'].explode().value_counts()
# Top 10 most popular genres
genre_counts.head(10)
Adventure 19368 Indie 16816 RPG 9234 Simulator 8050 Strategy 7766 Shooter 7258 Puzzle 6935 Platform 6126 Arcade 5603 Sport 3912 Name: Genres, dtype: int64
Adventure and Indie are the most prevalent genres in the dataset, followed by RPG and Simulator.
import matplotlib.pyplot as plt
import seaborn as sns
# Set the style of the plots
sns.set(style="whitegrid")
# Plot 1: Distribution of game ratings
plt.figure(figsize=(10, 6))
sns.histplot(data['Rating'], kde=False, bins=20)
plt.title('Distribution of Game Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()
# Plot 2: Top 10 most popular genres
plt.figure(figsize=(10, 6))
sns.barplot(x=genre_counts.head(10), y=genre_counts.head(10).index)
plt.title('Top 10 Most Popular Genres')
plt.xlabel('Number of Games')
plt.ylabel('Genre')
plt.show()
Distribution of Game Ratings: This histogram shows the distribution of game ratings. Each bar represents the frequency of games that fall within a particular rating range. For example, the highest bar represents the rating range with the most games. From the graph, it appears that the ratings are roughly normally distributed around 3, with fewer games receiving very low or very high ratings.
Top 10 Most Popular Genres: This bar chart shows the 10 most popular genres, ranked by the number of games in each genre. The length of each bar represents the number of games in the genre. Adventure and Indie are the most popular genres, followed by RPG (Role-Playing Game) and Simulator.
# Plot 3: Number of games released each year
data['Release_Year'] = data['Release_Date'].dt.year
games_per_year = data['Release_Year'].value_counts().sort_index()
games_per_year = games_per_year[games_per_year.index <= 2021]
plt.figure(figsize=(10, 6))
plt.plot(games_per_year.index, games_per_year.values, marker='o')
plt.xlabel('Year')
plt.ylabel('Number of Games Released')
plt.title('Number of Games Released Each Year')
plt.show()
This line graph shows the number of games released each year. The x-axis represents the year, and the y-axis represents the number of games released that year. Each point on the line represents a specific year. From this graph, we can see the trend of the number of games released over time.