Introduction

Pandas, a data manipulation library for Python, provides methods for detecting and handling missing data. In this tutorial, we will cover the isnull, notnull, dropna, and fillna methods.

Creating a DataFrame with Missing Data

First, let’s create a sample DataFrame that contains some missing data.

# Import Pandas library
import pandas as pd
import numpy as np

# Create a DataFrame with missing data
data = {
    'Name': ['Alice', 'Bob', None, 'David', 'Eve'],
    'Age': [25, None, 35, 40, None],
    'City': ['New York', 'Los Angeles', 'Boston', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)
df
##     Name   Age         City
## 0  Alice  25.0     New York
## 1    Bob   NaN  Los Angeles
## 2   None  35.0       Boston
## 3  David  40.0      Houston
## 4    Eve   NaN      Phoenix

In Pandas, NaN (which stands for “Not a Number”) is the standard missing data marker used for floating-point numbers, while None is the Pythonic way to represent the absence of a value.

When you insert None into a column of data type float, Pandas will convert it to NaN. However, if you insert None into an object data type column (like strings), Pandas will leave it as None.

Detecting Missing Data

Using isnull

The isnull method returns a DataFrame where each entry is a boolean value that indicates whether the corresponding data point is missing.

# Detect missing values using isnull
missing_data = df.isnull()
missing_data
##     Name    Age   City
## 0  False  False  False
## 1  False   True  False
## 2   True  False  False
## 3  False  False  False
## 4  False   True  False

Using notnull

The notnull method works in the opposite way to isnull. It returns True where data is not missing.

# Detect non-missing values using notnull
non_missing_data = df.notnull()
non_missing_data
##     Name    Age  City
## 0   True   True  True
## 1   True  False  True
## 2  False   True  True
## 3   True   True  True
## 4   True  False  True

Handling Missing Data

Using dropna

The dropna method allows you to drop rows or columns that contain missing data.

# Drop rows with missing data
dropped_rows = df.dropna()
dropped_rows
##     Name   Age      City
## 0  Alice  25.0  New York
## 3  David  40.0   Houston
# Drop columns with missing data
dropped_columns = df.dropna(axis=1)
dropped_columns
##           City
## 0     New York
## 1  Los Angeles
## 2       Boston
## 3      Houston
## 4      Phoenix

Using fillna

The fillna method allows you to replace missing data with a specific value or a method (like mean).

# Fill missing data with a specific value
filled_data = df.fillna("Unknown")
filled_data
##       Name      Age         City
## 0    Alice       25     New York
## 1      Bob  Unknown  Los Angeles
## 2  Unknown       35       Boston
## 3    David       40      Houston
## 4      Eve  Unknown      Phoenix
# Fill missing ages with the mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
df
##     Name        Age         City
## 0  Alice  25.000000     New York
## 1    Bob  33.333333  Los Angeles
## 2   None  35.000000       Boston
## 3  David  40.000000      Houston
## 4    Eve  33.333333      Phoenix

Summary