Testing a dataframe using random observations

Introduction

In the past when testing if what I had done was right, I would check some rows of the dataset that I created and see if it coincided with what I expected. This process, in addition to being very slow if it is repeated, is not rigorous because I can make a mistake when looking or perhaps it is possible that the new database will pass the test simply because it checked observations from the beginning or the end of the dataset. One way to test if what one is doing is right is through doing the same as before but automated. The idea will be the following, take rows at random and check if some type of test is fulfilled to know if our new dataset is what we expect.

This method is useful when the dataset is too large to test all the observations and when it is possible that observations at the beginning of the dataset differ from observations in the middle or at the end so it is necessary to take a random sample.

Code

First what we are going to do is generate a new dataset and assume that this new dataframe is a result of some script.

We are going to use a pandas dataframe se we import pandas and then we create the dataframe

import pandas as pd
df = {'Numbers':  [4, 20,  5, 2, 35],
        'Dummy': [0, 1, 0, 0, 0,]
        }
df = pd.DataFrame (df, columns = ['Numbers','Dummy'])

Suppose that the idea of this dataframe is that when the variable numbers is greater than 10 then the dummy has to take the value 1, otherwise 0. See that it is true for the second row since 20> 10 but it is not true for the last row. We are going to test our dataframe by accessing rows randomly and checking the condition that we said before.

First we import the function to generate integers.

from random import randint

Now we build a loop that will have the following structure For certain range, generate random number, use that number for access a row, then check if the condition is met in that row.

The n that we are going to choose will help us test 20% of our dataframe

n=round(len(df)*0.20)
for i in range(0,n):
    # Random row
    random_row=randint(0,len(df)-1)
    # We want to check if dummy is 1 for numbers>10
    if   df.loc[random_row]["Numbers"]>10:
        if df.loc[random_row]["Dummy"]!=1: 
            print("Error in row: ",random_row)
            break

The percentage will depend on how large the dataframe is, in some cases it will only be reasonable to test 5% or 1% of the observations (For example in the case of millions of observations). Finally, loop may not be very efficient, however the idea of this type of test is being able to make it fast and replace the practice of opening the dataframe and checking some observations.

Finally, another advantage of this type of test is that once one changes the code, it is not necessary to check manually but we can run the test immediately after creating the new dataframe.