Happy moments dataset basic analysis

Juan Andrés Cabral

In this tutorial, we explore various techniques to analyze textual data, specifically focusing on sentiment analysis. We will use a dataset of textual entries from a research study, which includes 100,000+ happy moments. The dataset, can be found on the author's Github repository. Or can be Download here: Dataset. Here is the Paper.

The dataset contains 100,535 entries. Each column has this many non-null entries, except for the ground_truth_category column, which has only 14,125 non-null entries.

This table shows that most happy moments are described using just 1 or 2 sentences. However, there are some instances where up to 68 sentences are used.

I'll remove common English stop words (like 'the', 'is', 'in', etc.), perform tokenization, and then calculate the frequency of each word. Finally, I'll display the top 20 most frequent words.

These are the words that appear most frequently in the descriptions of happy moments. They may give us some initial insights into what aspects are commonly associated with happiness. Let's create a word cloud representation

TextBlob is a Python library that can be used for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

To perform sentiment analysis, TextBlob uses an analyzer that returns a number called polarity:

Polarity: a float within the range [-1.0, 1.0] where -1 means negative sentiment, 0 means neutral sentiment, and 1 means positive sentiment. Let's apply TextBlob to the cleaned_hm column.

Now let's make a histogram.

The histogram above displays the distribution of the polarity scores in the cleaned_hm column. The mean polarity score is 0.23, represented by the red dashed line in the histogram. The standard deviation of the polarity scores is 0.29.

A polarity score of 0.23 indicates that the sentiment in the dataset leans towards positive. The standard deviation of 0.29 shows there is a relatively moderate variability in the sentiment scores. Now, we are going to analyze how the length of the messages relates to the sentiments expressed in the texts

The image above displays the distribution of the message lengths in the cleaned_hm column. The mean message length is 93.10, represented by the red dashed line in the histogram. The standard deviation of the message lengths is 115.63.

This suggests that while the average message length is about 93 characters, there is a large variation in message lengths, with many messages being much shorter or longer than the average.

Next, we can check if there's any correlation between message length and sentiment. For this, we can use a scatter plot to visualize the relationship between message length and polarity scores.

The scatter plot above shows the relationship between message length and polarity scores in the cleaned_hm column. Each point represents a message, with the x-coordinate indicating the length of the message and the y-coordinate indicating the polarity score.

From the plot, it's hard to see a clear trend or correlation between message length and polarity. The sentiment seems to be evenly distributed across different message lengths, suggesting that in this dataset, the length of the message does not have a strong impact on its sentiment.