How to do Sentiment Analysis with Deep Learning (LSTM Keras)
 Automatically Classify Reviews as Positive or Negative in Python

Lianne & Justin

Lianne & Justin

Share on twitter
Share on linkedin
Share on facebook
Share on email
sentiment analysis leaves
Source: Burst

In this tutorial, we build a deep learning neural network model to classify the sentiment of Yelp reviews.

Following the step-by-step procedures in Python, you’ll see a real life example and learn:

  • How to prepare review text data for sentiment analysis, including NLP techniques.
  • How to tune the hyperparameters for the machine learning models.
  • How to predict sentiment by building an LSTM model in Tensorflow Keras.
  • How to evaluate model performance.
  • How sample sizes impact the results compared to a pre-trained tool.
  • And more.

If you want to benefit your marketing using sentiment analysis, you’ll enjoy this post.

Let’s get started!



The example dataset we are using is the Yelp Open Dataset. It contains different data, but we’ll be focusing on the reviews only.

Yelp open dataset reviews

To take a look at the data, let’s read it in chunks into Python. We only keep two features: stars ratings and text of the reviews.

We will build a model that can predict the sentiment of the reviews based on its text.

The number of records is different from the current dataset, but it doesn’t impact the analysis.

Step #1: Preprocessing the Data for Sentiment Analysis

Observing the Data

Before transforming the dataset df_review_text, let’s take a brief look at it.

We check for any missing values, which returns “num missing text: 0”.

We look at the distribution of the stars from the reviews.

We can see that people are positive to mainly give 4 or 5 stars.

Yelp reviews starts distribution

And we also print out an example of the feature text.

Defining the Sentiment

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

Wikipedia

To start the analysis, we must define the classification of sentiment.

What is a positive review? What is a negative review?

This is simple with the stars feature. We create a new feature sentiment with values 0 and 1. The reviews with stars above 3 are “positive”, with a value of 1. Others are “negative”, with a value of 0.

We can see that 65.84% are positive reviews.

Yelp open dataset sentiment

Splitting the Dataset into Train and Test

Next, we split the dataset into training and testing sets df_train and df_test by random shuffling.

df_test contains 1% of the original dataset. And it has a similar percentage of positive reviews as df_train.

Yelp open dataset sentiment
Yelp open dataset sentiment

Experimenting with Sample Sizes

The Yelp dataset is easy to label with the feature stars. But in reality, we often don’t have such a dataset, which means manual labeling might be the only solution.

So we want to model with different sample sizes. In the end, we’ll compare the model performance with a pre-trained sentiment model.

We will use three different sample sizes of 200, 2,000, and 20,000.

The code below only demonstrates the 20,000 sample size. A new dataset df_train0 is created by taking the first 20,000 rows from df_train. The df_train0 is a random sample of the original dataset, since we shuffled the data when splitting the train and test datasets.

Yelp open dataset sentiment

Further Splitting the Dataset into Train and Validation

Also, we split df_train0 further to train and validation datasets as df0_train and df0_val.

Setting up Target and Features

Then for both df0_train and df0_val, we set the sentiment as the target, and the text as the feature for the analysis.

Preprocessing the Text: Tokenization and Conversion to Sequences

In this procedure, we transform the text to help the computer understand them better.

We limit the vocabulary length of the text and tokenize them. For an explanation about tokenization, take a look at How to use NLP in Python: a Practical Step-by-Step Example.

Then we transform each text in texts to a sequence of integers.

To print the distribution for number of words in the new sequence X_train_seq:

To look at an example of the tokenized and converted review text:

Yelp open dataset text to sequence

Now we have the data ready for modeling!

Related article: How to use NLP in Python: a Practical Step-by-Step Example

Step #2: Tuning the Hyperparameters

As mentioned earlier, we are modeling the data with Long Short-Term Memory (LSTM) using TensorFlow Keras neural networks library.

Before fitting, we want to tune the hyperparameters of the model to achieve better performance. If you are not familiar with why and how to optimize the hyperparameters, please take a look at Hyperparameter Tuning with Python: Keras Step-by-Step Guide.

Within the below Python code, we define:

  • the LSTM model in Keras
  • the hyperparameters of the model
  • the objective function/score for the hyperparameters optimization
  • the training settings

Then we also set the limits for the values of hyperparameters that will be tuned.

We use the same package Ax to set up the experiment for hyperparameter tuning. Again, the details can be found in Hyperparameter Tuning with Python: Keras Step-by-Step Guide.

As you can see from the printed log, the Gaussian Process (Sobol+GPEI), a type of Bayesian Optimization method, is chosen in this exercise by Ax_client.

Ax package hyperparameter tuning log note

Now we can tune these hyperparameters. We run a small number of 20 trials and print the results.

The below table contains the score (keras_cv) and the combinations of hyperparameter values.

Hyperparameter tuning Ax package result

The best parameters can be printed below.

hyperparameter tuning Ax best parameters Keras LSTM

Let’s move on to fit the model using these hyperparameters.

Related article: Hyperparameter Tuning with Python: Keras Step-by-Step Guide

Step #3: Fitting the LSTM model using Keras

Training the Model

Using the above hyperparameters, we train the model below.

Training LSTM with Keras

Evaluating the Performance: ROC/AUC

We can use the model to predict classification of reviews for the test dataset.

And based on the above prediction, we can also look at the ROC/AUC of the model.

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate and False Positive Rate. An ROC curve plots TPR vs. FPR at different classification thresholds.

AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

Google Developers

To evaluate the model, we calculate the AUC for the LSTM model below.

The AUC is 0.9197, which is pretty good.

Evaluating the Performance: Visualization

We can also visualize the classifications.

We can see that the majority of positive reviews (orange) has y_pred value closer to 1; and most of the negative reviews (blue) has y_pred closer to 0.

Yelp reviews sentiment LSTM model classification

Evaluating the Performance: by Sample Sizes

As you might recall, we ran the same procedures for different sample sizes of 200, 2,000, and 20,000.

We also want to compare the performance with a built-in sentiment tool TextBlob. Let’s use the TextBlob library to classify our test dataset as well.

We calculate its AUC, which is 0.8607.

We can also check the visualization of its classification results.

Yelp reviews sentiment Textblob model classification

Lastly, let’s look at the performance of the LSTM models and Textblob together.

We use the below code to calculate the FPRs and TPRs.

Then we put the results all together.

The LSTM model with 20,000 sample size is a winner. But Textblob beat the smaller samples.

So when the labeled sample size is too small, save the effort and try the built-in classifier first!

Yelp reviews sentiment LSTM model by sample sizes vs Textblob

Thank you for reading! Leave a comment if you have any questions. We’ll try our best to answer.

Before you leave, don’t forget to sign up for the Just into Data newsletter below! Or connect with us on Twitter, Facebook.
So you won’t miss any new data science articles from us!

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on facebook
Facebook
Share on email
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top
We use cookies to ensure you get the best experience on our website.  Learn more.