Basic data analysis with popular videogames dataset

Cabral Juan Andrés

In this tutorial, we cover some basic commands for conducting an introductory analysis of a dataset. The goal of this tutorial is to provide me with handy snippets of code that I can use whenever I need to analyze a dataset. The dataset we will be analyzing here is about video games and was obtained from Kaggle: Popular Video Games Dataset by Matheus Fonseca Chaves.

Loading the data

Basic description of the data

The dataset contains 60,000 rows and 14 columns.

The game "Date A Live Twin Edition: Rio Reincarnation" appears most frequently in the dataset.\ There are 8,956 unique release dates and 'TBD' (To Be Determined) is the most common value.\ There are 18,356 unique developers and an empty list appears most frequently.\ 'Windows PC' is the most common platform the games are available on.\ There are 1,751 unique genres and an empty list appears most frequently. \ The average game rating is approximately 3.03 with a standard deviation of about 0.74. Ratings range from 0.3 to 5.

Data cleaning

The unnecessary 'Unnamed: 0' column has been removed.\ The 'Release_Date' column has been converted to datetime format.\ The 'Plays', 'Playing', 'Backlogs', 'Wishlist', 'Lists', 'Reviews' columns have been converted from string to float. Strings with a 'K' suffix have been converted to their numerical equivalents (e.g., '21K' to 21000).

Basic analysis

Note that some games appear more than once in the top 10 list. This is because they might be released on different platforms or have different versions.

These are some of the most popular games based on the number of plays.

Next, let's look at the most popular genres. For this, we need to first clean the 'Genres' column as it contains lists represented as strings. We will then count the occurrence of each genre.

Adventure and Indie are the most prevalent genres in the dataset, followed by RPG and Simulator.

Some graphics

Distribution of Game Ratings: This histogram shows the distribution of game ratings. Each bar represents the frequency of games that fall within a particular rating range. For example, the highest bar represents the rating range with the most games. From the graph, it appears that the ratings are roughly normally distributed around 3, with fewer games receiving very low or very high ratings.

Top 10 Most Popular Genres: This bar chart shows the 10 most popular genres, ranked by the number of games in each genre. The length of each bar represents the number of games in the genre. Adventure and Indie are the most popular genres, followed by RPG (Role-Playing Game) and Simulator.

This line graph shows the number of games released each year. The x-axis represents the year, and the y-axis represents the number of games released that year. Each point on the line represents a specific year. From this graph, we can see the trend of the number of games released over time.