In this tutorial, we build a deep learning neural network model to classify the sentiment of Yelp reviews.
Following the step-by-step procedures in Python, you’ll see a real life example and learn:
- How to prepare review text data for sentiment analysis, including NLP techniques.
- How to tune the hyperparameters for the machine learning models.
- How to predict sentiment by building an LSTM model in Tensorflow Keras.
- How to evaluate model performance.
- How sample sizes impact the results compared to a pre-trained tool.
- And more.
If you want to benefit your marketing using sentiment analysis, you’ll enjoy this post.
Let’s get started!
The example dataset we are using is the Yelp Open Dataset. It contains different data, but we’ll be focusing on the reviews only.

To take a look at the data, let’s read it in chunks into Python. We only keep two features: stars ratings and text of the reviews.
We will build a model that can predict the sentiment of the reviews based on its text.

Step #1: Preprocessing the Data for Sentiment Analysis
Observing the Data
Before transforming the dataset df_review_text, let’s take a brief look at it.
We check for any missing values, which returns “num missing text: 0”.
We look at the distribution of the stars from the reviews.
We can see that people are positive to mainly give 4 or 5 stars.

And we also print out an example of the feature text.

Defining the Sentiment
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
Wikipedia
To start the analysis, we must define the classification of sentiment.
What is a positive review? What is a negative review?
This is simple with the stars feature. We create a new feature sentiment with values 0 and 1. The reviews with stars above 3 are “positive”, with a value of 1. Others are “negative”, with a value of 0.
We can see that 65.84% are positive reviews.

Splitting the Dataset into Train and Test
Next, we split the dataset into training and testing sets df_train and df_test by random shuffling.
df_test contains 1% of the original dataset. And it has a similar percentage of positive reviews as df_train.


Experimenting with Sample Sizes
The Yelp dataset is easy to label with the feature stars. But in reality, we often don’t have such a dataset, which means manual labeling might be the only solution.
So we want to model with different sample sizes. In the end, we’ll compare the model performance with a pre-trained sentiment model.
We will use three different sample sizes of 200, 2,000, and 20,000.
The code below only demonstrates the 20,000 sample size. A new dataset df_train0 is created by taking the first 20,000 rows from df_train. The df_train0 is a random sample of the original dataset, since we shuffled the data when splitting the train and test datasets.

Further Splitting the Dataset into Train and Validation
Also, we split df_train0 further to train and validation datasets as df0_train and df0_val.
Setting up Target and Features
Then for both df0_train and df0_val, we set the sentiment as the target, and the text as the feature for the analysis.
Preprocessing the Text: Tokenization and Conversion to Sequences
In this procedure, we transform the text to help the computer understand them better.
We limit the vocabulary length of the text and tokenize them. For an explanation about tokenization, take a look at How to use NLP in Python: a Practical Step-by-Step Example.
Then we transform each text in texts to a sequence of integers.
To print the distribution for number of words in the new sequence X_train_seq:

To look at an example of the tokenized and converted review text:

Now we have the data ready for modeling!
Related article: How to use NLP in Python: a Practical Step-by-Step Example
Step #2: Tuning the Hyperparameters
As mentioned earlier, we are modeling the data with Long Short-Term Memory (LSTM) using TensorFlow Keras neural networks library.
Before fitting, we want to tune the hyperparameters of the model to achieve better performance. If you are not familiar with why and how to optimize the hyperparameters, please take a look at Hyperparameter Tuning with Python: Keras Step-by-Step Guide.
Within the below Python code, we define:
- the LSTM model in Keras
- the hyperparameters of the model
- the objective function/score for the hyperparameters optimization
- the training settings
Then we also set the limits for the values of hyperparameters that will be tuned.
We use the same package Ax to set up the experiment for hyperparameter tuning. Again, the details can be found in Hyperparameter Tuning with Python: Keras Step-by-Step Guide.
As you can see from the printed log, the Gaussian Process (Sobol+GPEI), a type of Bayesian Optimization method, is chosen in this exercise by Ax_client.

Now we can tune these hyperparameters. We run a small number of 20 trials and print the results.
The below table contains the score (keras_cv) and the combinations of hyperparameter values.

The best parameters can be printed below.

Let’s move on to fit the model using these hyperparameters.
Related article: Hyperparameter Tuning with Python: Keras Step-by-Step Guide
Step #3: Fitting the LSTM model using Keras
Training the Model
Using the above hyperparameters, we train the model below.

Evaluating the Performance: ROC/AUC
We can use the model to predict classification of reviews for the test dataset.
And based on the above prediction, we can also look at the ROC/AUC of the model.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate and False Positive Rate. An ROC curve plots TPR vs. FPR at different classification thresholds.
AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
Google Developers
To evaluate the model, we calculate the AUC for the LSTM model below.
The AUC is 0.9197, which is pretty good.
Evaluating the Performance: Visualization
We can also visualize the classifications.
We can see that the majority of positive reviews (orange) has y_pred value closer to 1; and most of the negative reviews (blue) has y_pred closer to 0.

Evaluating the Performance: by Sample Sizes
As you might recall, we ran the same procedures for different sample sizes of 200, 2,000, and 20,000.
We also want to compare the performance with a built-in sentiment tool TextBlob. Let’s use the TextBlob library to classify our test dataset as well.
We calculate its AUC, which is 0.8607.
We can also check the visualization of its classification results.

Lastly, let’s look at the performance of the LSTM models and Textblob together.
We use the below code to calculate the FPRs and TPRs.
Then we put the results all together.
The LSTM model with 20,000 sample size is a winner. But Textblob beat the smaller samples.
So when the labeled sample size is too small, save the effort and try the built-in classifier first!

Thank you for reading! Leave a comment if you have any questions. We’ll try our best to answer.
Before you leave, don’t forget to sign up for the Just into Data newsletter below! Or connect with us on Twitter, Facebook.
So you won’t miss any new data science articles from us!
4 thoughts on “How to do Sentiment Analysis with Deep Learning (LSTM Keras)<br /><div style='color:#7A7A7A;font-size: large;font-family:roboto;font-weight:400;'> Automatically Classify Reviews as Positive or Negative in Python</div>”
Hi Lianne and Justin,
It is a very nice story and I appreciate the instruction.
Just a quick question about the ax_platform, I tried to run your code on Jupyter Notebook. However, for using the service loop of ax_platform, Jupyter reported an error. I searched the error on google but found nothing. Just wonder do you know what causes the error? Thank you so much.
TypeError: type of val must be one of (Dict[str, Union[float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]]]], float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]], List[Tuple[Dict[str, Union[str, bool, float, int, NoneType]], Dict[str, Union[float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]]]]]], List[Tuple[Dict[str, Hashable], Dict[str, Union[float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]]]]]]); got dict instead
During handling of the above exception, another exception occurred:
ValueError: Raw data must be data for a single arm for non batched trials.
Sorry but we’re not too sure what is causing that error. Based on the message, it looks like your dictionary isn’t in the right format that the function expects.
Getting the same error. Did you figure this out?
I’m getting the same error. Anyone find a solution yet?