In this tutorial, we'll be exploring a dataset of television series data from IMDB. The data was obtained from kaggle.com. Author: Khushi Pitroda. You can download the dataset Here.
This dataset is a rich collection of TV series information, including the series name, airing years, number of episodes, content rating, IMDB rating, an image URL, a brief description, and the URL of the IMDB page for each series. With over 250 entries, the dataset provides a comprehensive look at some of the most popular and highly rated TV series from the past few decades.
Throughout this tutorial, we will walk you through various stages of data analysis, including data cleaning, basic descriptive statistics, and visualization. We'll clean the data to ensure it's in a usable format, calculate basic statistics to understand the distribution of values, and create meaningful visualizations to uncover patterns and trends in the data.
import pandas as pd
# Load the dataset
df = pd.read_csv('C:/Users/juanc/Downloads/archive/IMDB.csv')
# Display the first few rows of the dataset
df.head()
| Name | Year | Episodes | Type | Rating | Image-src | Description | Name-href | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1. Breaking Bad | 2008–2013 | 62 eps | TV-MA | 9.5 | https://m.media-amazon.com/images/M/MV5BYmQ4YW... | A chemistry teacher diagnosed with inoperable ... | https://www.imdb.com/title/tt0903747/?ref_=cht... |
| 1 | 2. Planet Earth II | 2016 | 6 eps | TV-G | 9.5 | https://m.media-amazon.com/images/M/MV5BMGZmYm... | David Attenborough returns with a new wildlife... | https://www.imdb.com/title/tt5491994/?ref_=cht... |
| 2 | 3. Planet Earth | 2006 | 11 eps | TV-PG | 9.4 | https://m.media-amazon.com/images/M/MV5BMzMyYj... | A documentary series on the wildlife found on ... | https://www.imdb.com/title/tt0795176/?ref_=cht... |
| 3 | 4. Band of Brothers | 2001 | 10 eps | TV-MA | 9.4 | https://m.media-amazon.com/images/M/MV5BMTI3OD... | The story of Easy Company of the U.S. Army 101... | https://www.imdb.com/title/tt0185906/?ref_=cht... |
| 4 | 5. Chernobyl | 2019 | 5 eps | TV-MA | 9.4 | https://m.media-amazon.com/images/M/MV5BNTdkN2... | In April 1986, an explosion at the Chernobyl n... | https://www.imdb.com/title/tt7366338/?ref_=cht... |
In order to analyze this data, we will first need to clean it. This will involve:
Removing the ranking number from the 'Name' column.\ Parsing the 'Year' column into start and end years.\ Extracting the number of episodes from the 'Episodes' column.\ Handling any missing or malformed data.
# Data Cleaning
# Remove ranking from 'Name' column
df['Name'] = df['Name'].str.split('.').str[1].str.strip()
# Parse 'Year' column into start and end years
df['Start Year'] = df['Year'].str.split('–').str[0].astype(int)
df['End Year'] = df['Year'].str.split('–').str[1]
# Replace empty strings in 'End Year' with NaN
df['End Year'] = df['End Year'].replace('', pd.NaT)
# Fill NaN in 'End Year' with 'Start Year'
df['End Year'] = df['End Year'].fillna(df['Start Year']).astype(int)
# Extract number of episodes from 'Episodes' column
df['Episodes'] = df['Episodes'].str.split(' ').str[0].astype(int)
# Display the cleaned dataset
df.head()
| Name | Year | Episodes | Type | Rating | Image-src | Description | Name-href | Start Year | End Year | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Breaking Bad | 2008–2013 | 62 | TV-MA | 9.5 | https://m.media-amazon.com/images/M/MV5BYmQ4YW... | A chemistry teacher diagnosed with inoperable ... | https://www.imdb.com/title/tt0903747/?ref_=cht... | 2008 | 2013 |
| 1 | Planet Earth II | 2016 | 6 | TV-G | 9.5 | https://m.media-amazon.com/images/M/MV5BMGZmYm... | David Attenborough returns with a new wildlife... | https://www.imdb.com/title/tt5491994/?ref_=cht... | 2016 | 2016 |
| 2 | Planet Earth | 2006 | 11 | TV-PG | 9.4 | https://m.media-amazon.com/images/M/MV5BMzMyYj... | A documentary series on the wildlife found on ... | https://www.imdb.com/title/tt0795176/?ref_=cht... | 2006 | 2006 |
| 3 | Band of Brothers | 2001 | 10 | TV-MA | 9.4 | https://m.media-amazon.com/images/M/MV5BMTI3OD... | The story of Easy Company of the U.S. Army 101... | https://www.imdb.com/title/tt0185906/?ref_=cht... | 2001 | 2001 |
| 4 | Chernobyl | 2019 | 5 | TV-MA | 9.4 | https://m.media-amazon.com/images/M/MV5BNTdkN2... | In April 1986, an explosion at the Chernobyl n... | https://www.imdb.com/title/tt7366338/?ref_=cht... | 2019 | 2019 |
Let's now proceed with a basic descriptive analysis of the data. We'll look at the distributions of ratings, the number of TV series per content type, and the distribution of the number of episodes. We'll also examine the distribution of TV series over years.
# Basic Descriptive Analysis
# Statistics for numerical columns
numerical_stats = df[['Episodes', 'Rating', 'Start Year', 'End Year']].describe()
# Number of TV series per content type
type_counts = df['Type'].value_counts()
numerical_stats, type_counts
( Episodes Rating Start Year End Year count 250.000000 250.000000 250.000000 250.000000 mean 73.328000 8.762400 2006.996000 2010.800000 std 112.606631 0.230475 12.519703 12.173115 min 2.000000 8.400000 1955.000000 1962.000000 25% 14.250000 8.600000 2001.000000 2004.250000 50% 36.000000 8.700000 2010.000000 2015.000000 75% 78.000000 8.900000 2016.000000 2019.750000 max 1076.000000 9.500000 2023.000000 2024.000000, TV-MA 108 TV-14 77 TV-PG 47 TV-G 5 Not Rated 3 TV-Y7-FV 2 TV-Y 1 PG-13 1 TV-Y7 1 Name: Type, dtype: int64)
The most common content rating is 'TV-MA', which stands for "Mature Audience". The least common ratings are 'TV-Y', 'PG-13', and 'TV-Y7', each with only one TV series.
import matplotlib.pyplot as plt
import seaborn as sns
# Set the style of seaborn
sns.set(style="whitegrid")
# Create subplots
fig, ax = plt.subplots(2, 2, figsize=(18, 12))
# Plot distribution of ratings
sns.histplot(df['Rating'], kde=True, ax=ax[0, 0], color='skyblue', bins=20)
ax[0, 0].set_title('Distribution of Ratings')
# Plot number of TV series per content type
sns.countplot(y='Type', data=df, ax=ax[0, 1], order=df['Type'].value_counts().index, palette='viridis')
ax[0, 1].set_title('Number of TV Series per Content Type')
# Plot distribution of the number of episodes
sns.histplot(df['Episodes'], kde=False, ax=ax[1, 0], color='skyblue', bins=50)
ax[1, 0].set_title('Distribution of Number of Episodes')
# Plot distribution of TV series over years
sns.histplot(df['Start Year'], kde=False, ax=ax[1, 1], color='skyblue', bins=30)
ax[1, 1].set_title('Distribution of TV Series over Years')
# Adjust the layout
plt.tight_layout()
plt.show()
This histogram shows that the majority of TV series have ratings between approximately 8.6 and 8.8. The distribution is roughly normal, but slightly skewed to the left.
This bar plot shows the count of TV series for each content type. The most common content type is 'TV-MA', followed by 'TV-14' and 'TV-PG'. The least common types are 'TV-Y', 'PG-13', and 'TV-Y7'.
This histogram shows that most TV series have a relatively small number of episodes, with a sharp drop-off as the number of episodes increases. There are a few series with a very large number of episodes, as evidenced by the long tail of the distribution. This is characteristic of a positively skewed distribution.
This histogram shows the number of TV series that started airing in each year. The distribution shows that the number of series has generally increased over time, with a particularly noticeable increase starting around the year 2000.