In this tutorial, we'll be exploring a dataset of television series data from IMDB. The data was obtained from kaggle.com. Author: Khushi Pitroda. You can download the dataset Here.

This dataset is a rich collection of TV series information, including the series name, airing years, number of episodes, content rating, IMDB rating, an image URL, a brief description, and the URL of the IMDB page for each series. With over 250 entries, the dataset provides a comprehensive look at some of the most popular and highly rated TV series from the past few decades.

Throughout this tutorial, we will walk you through various stages of data analysis, including data cleaning, basic descriptive statistics, and visualization. We'll clean the data to ensure it's in a usable format, calculate basic statistics to understand the distribution of values, and create meaningful visualizations to uncover patterns and trends in the data.

In [12]:

import pandas as pd

# Load the dataset
df = pd.read_csv('C:/Users/juanc/Downloads/archive/IMDB.csv')

# Display the first few rows of the dataset
df.head()

Out[12]:

	Name	Year	Episodes	Type	Rating	Image-src	Description	Name-href
0	1. Breaking Bad	2008–2013	62 eps	TV-MA	9.5	https://m.media-amazon.com/images/M/MV5BYmQ4YW...	A chemistry teacher diagnosed with inoperable ...	https://www.imdb.com/title/tt0903747/?ref_=cht...
1	2. Planet Earth II	2016	6 eps	TV-G	9.5	https://m.media-amazon.com/images/M/MV5BMGZmYm...	David Attenborough returns with a new wildlife...	https://www.imdb.com/title/tt5491994/?ref_=cht...
2	3. Planet Earth	2006	11 eps	TV-PG	9.4	https://m.media-amazon.com/images/M/MV5BMzMyYj...	A documentary series on the wildlife found on ...	https://www.imdb.com/title/tt0795176/?ref_=cht...
3	4. Band of Brothers	2001	10 eps	TV-MA	9.4	https://m.media-amazon.com/images/M/MV5BMTI3OD...	The story of Easy Company of the U.S. Army 101...	https://www.imdb.com/title/tt0185906/?ref_=cht...
4	5. Chernobyl	2019	5 eps	TV-MA	9.4	https://m.media-amazon.com/images/M/MV5BNTdkN2...	In April 1986, an explosion at the Chernobyl n...	https://www.imdb.com/title/tt7366338/?ref_=cht...

In order to analyze this data, we will first need to clean it. This will involve:

Removing the ranking number from the 'Name' column.\ Parsing the 'Year' column into start and end years.\ Extracting the number of episodes from the 'Episodes' column.\ Handling any missing or malformed data.

In [13]:

# Data Cleaning

# Remove ranking from 'Name' column
df['Name'] = df['Name'].str.split('.').str[1].str.strip()

# Parse 'Year' column into start and end years
df['Start Year'] = df['Year'].str.split('–').str[0].astype(int)
df['End Year'] = df['Year'].str.split('–').str[1]

# Replace empty strings in 'End Year' with NaN
df['End Year'] = df['End Year'].replace('', pd.NaT)

# Fill NaN in 'End Year' with 'Start Year'
df['End Year'] = df['End Year'].fillna(df['Start Year']).astype(int)

# Extract number of episodes from 'Episodes' column
df['Episodes'] = df['Episodes'].str.split(' ').str[0].astype(int)

# Display the cleaned dataset
df.head()

Out[13]:

	Name	Year	Episodes	Type	Rating	Image-src	Description	Name-href	Start Year	End Year
0	Breaking Bad	2008–2013	62	TV-MA	9.5	https://m.media-amazon.com/images/M/MV5BYmQ4YW...	A chemistry teacher diagnosed with inoperable ...	https://www.imdb.com/title/tt0903747/?ref_=cht...	2008	2013
1	Planet Earth II	2016	6	TV-G	9.5	https://m.media-amazon.com/images/M/MV5BMGZmYm...	David Attenborough returns with a new wildlife...	https://www.imdb.com/title/tt5491994/?ref_=cht...	2016	2016
2	Planet Earth	2006	11	TV-PG	9.4	https://m.media-amazon.com/images/M/MV5BMzMyYj...	A documentary series on the wildlife found on ...	https://www.imdb.com/title/tt0795176/?ref_=cht...	2006	2006
3	Band of Brothers	2001	10	TV-MA	9.4	https://m.media-amazon.com/images/M/MV5BMTI3OD...	The story of Easy Company of the U.S. Army 101...	https://www.imdb.com/title/tt0185906/?ref_=cht...	2001	2001
4	Chernobyl	2019	5	TV-MA	9.4	https://m.media-amazon.com/images/M/MV5BNTdkN2...	In April 1986, an explosion at the Chernobyl n...	https://www.imdb.com/title/tt7366338/?ref_=cht...	2019	2019

In [15]:

# Basic Descriptive Analysis

# Statistics for numerical columns
numerical_stats = df[['Episodes', 'Rating', 'Start Year', 'End Year']].describe()

# Number of TV series per content type
type_counts = df['Type'].value_counts()

numerical_stats, type_counts

Out[15]:

(          Episodes      Rating   Start Year     End Year
 count   250.000000  250.000000   250.000000   250.000000
 mean     73.328000    8.762400  2006.996000  2010.800000
 std     112.606631    0.230475    12.519703    12.173115
 min       2.000000    8.400000  1955.000000  1962.000000
 25%      14.250000    8.600000  2001.000000  2004.250000
 50%      36.000000    8.700000  2010.000000  2015.000000
 75%      78.000000    8.900000  2016.000000  2019.750000
 max    1076.000000    9.500000  2023.000000  2024.000000,
 TV-MA        108
 TV-14         77
 TV-PG         47
 TV-G           5
 Not Rated      3
 TV-Y7-FV       2
 TV-Y           1
 PG-13          1
 TV-Y7          1
 Name: Type, dtype: int64)

Basic Statistics for the Numerical Columns¶

Episodes:¶

Count: 250 TV series
Mean: On average, a TV series has about 73 episodes.
Standard Deviation: The number of episodes varies widely, with a standard deviation of about 113.
Minimum: The TV series with the fewest episodes has only 2 episodes.
25th Percentile (Q1): 25% of TV series have 14 or fewer episodes.
Median (Q2 / 50th Percentile): The median number of episodes for a TV series is 36.
75th Percentile (Q3): 75% of TV series have 78 or fewer episodes.
Maximum: The TV series with the most episodes has 1076 episodes.

Rating:¶

Count: 250 ratings
Mean: The average rating is approximately 8.76.
Standard Deviation: The ratings have a relatively small standard deviation of about 0.23, indicating that the ratings are quite tightly clustered.
Minimum: The lowest rating is 8.4.
25th Percentile (Q1): 25% of TV series have a rating of 8.6 or lower.
Median (Q2 / 50th Percentile): The median rating is 8.7.
75th Percentile (Q3): 75% of TV series have a rating of 8.9 or lower.
Maximum: The highest rating is 9.5.

Start Year and End Year:¶

The TV series in this dataset aired between 1955 and 2023.
The median start year is 2010, and the median end year is 2015.

Counts for Each Content Type:¶

TV-MA: 108
TV-14: 77
TV-PG: 47
TV-G: 5
Not Rated: 3
TV-Y7-FV: 2
TV-Y: 1
PG-13: 1
TV-Y7: 1

The most common content rating is 'TV-MA', which stands for "Mature Audience". The least common ratings are 'TV-Y', 'PG-13', and 'TV-Y7', each with only one TV series.

In [16]:

import matplotlib.pyplot as plt
import seaborn as sns

# Set the style of seaborn
sns.set(style="whitegrid")

# Create subplots
fig, ax = plt.subplots(2, 2, figsize=(18, 12))

# Plot distribution of ratings
sns.histplot(df['Rating'], kde=True, ax=ax[0, 0], color='skyblue', bins=20)
ax[0, 0].set_title('Distribution of Ratings')

# Plot number of TV series per content type
sns.countplot(y='Type', data=df, ax=ax[0, 1], order=df['Type'].value_counts().index, palette='viridis')
ax[0, 1].set_title('Number of TV Series per Content Type')

# Plot distribution of the number of episodes
sns.histplot(df['Episodes'], kde=False, ax=ax[1, 0], color='skyblue', bins=50)
ax[1, 0].set_title('Distribution of Number of Episodes')

# Plot distribution of TV series over years
sns.histplot(df['Start Year'], kde=False, ax=ax[1, 1], color='skyblue', bins=30)
ax[1, 1].set_title('Distribution of TV Series over Years')

# Adjust the layout
plt.tight_layout()
plt.show()

Visualizations Overview¶

Distribution of Ratings¶

This histogram shows that the majority of TV series have ratings between approximately 8.6 and 8.8. The distribution is roughly normal, but slightly skewed to the left.

Number of TV Series per Content Type¶

This bar plot shows the count of TV series for each content type. The most common content type is 'TV-MA', followed by 'TV-14' and 'TV-PG'. The least common types are 'TV-Y', 'PG-13', and 'TV-Y7'.

Distribution of Number of Episodes¶

This histogram shows that most TV series have a relatively small number of episodes, with a sharp drop-off as the number of episodes increases. There are a few series with a very large number of episodes, as evidenced by the long tail of the distribution. This is characteristic of a positively skewed distribution.

Distribution of TV Series over Years¶

This histogram shows the number of TV series that started airing in each year. The distribution shows that the number of series has generally increased over time, with a particularly noticeable increase starting around the year 2000.

Basic data analysis with spotify song attributes dataset¶

Juan Andrés Cabral¶