Basic data analysis with spotify song attributes dataset

Juan Andrés Cabral

In this tutorial, we'll be exploring a dataset of television series data from IMDB. The data was obtained from kaggle.com. Author: Khushi Pitroda. You can download the dataset Here.

This dataset is a rich collection of TV series information, including the series name, airing years, number of episodes, content rating, IMDB rating, an image URL, a brief description, and the URL of the IMDB page for each series. With over 250 entries, the dataset provides a comprehensive look at some of the most popular and highly rated TV series from the past few decades.

Throughout this tutorial, we will walk you through various stages of data analysis, including data cleaning, basic descriptive statistics, and visualization. We'll clean the data to ensure it's in a usable format, calculate basic statistics to understand the distribution of values, and create meaningful visualizations to uncover patterns and trends in the data.

In order to analyze this data, we will first need to clean it. This will involve:

Removing the ranking number from the 'Name' column.\ Parsing the 'Year' column into start and end years.\ Extracting the number of episodes from the 'Episodes' column.\ Handling any missing or malformed data.

Let's now proceed with a basic descriptive analysis of the data. We'll look at the distributions of ratings, the number of TV series per content type, and the distribution of the number of episodes. We'll also examine the distribution of TV series over years.

Basic Statistics for the Numerical Columns

Episodes:

Rating:

Start Year and End Year:

Counts for Each Content Type:

The most common content rating is 'TV-MA', which stands for "Mature Audience". The least common ratings are 'TV-Y', 'PG-13', and 'TV-Y7', each with only one TV series.

Visualizations Overview

Distribution of Ratings

This histogram shows that the majority of TV series have ratings between approximately 8.6 and 8.8. The distribution is roughly normal, but slightly skewed to the left.

Number of TV Series per Content Type

This bar plot shows the count of TV series for each content type. The most common content type is 'TV-MA', followed by 'TV-14' and 'TV-PG'. The least common types are 'TV-Y', 'PG-13', and 'TV-Y7'.

Distribution of Number of Episodes

This histogram shows that most TV series have a relatively small number of episodes, with a sharp drop-off as the number of episodes increases. There are a few series with a very large number of episodes, as evidenced by the long tail of the distribution. This is characteristic of a positively skewed distribution.

Distribution of TV Series over Years

This histogram shows the number of TV series that started airing in each year. The distribution shows that the number of series has generally increased over time, with a particularly noticeable increase starting around the year 2000.