In this tutorial, we build a deep learning neural network model to classify the sentiment of Yelp reviews.

Following the *step-by-step* procedures in **Python**, you’ll see a real life example and learn:

- How to
**prepare**review**text data for sentiment analysis**, including NLP techniques. - How to
**tune the hyperparameters**for the machine learning models. - How to predict sentiment by
**building an LSTM****model in**Tensorflow**Keras**. - How to
**evaluate**model**performance**. - How
**sample sizes****impact**the results compared to a pre-trained tool. - And more.

If you want to benefit your marketing using sentiment analysis, you’ll enjoy this post.

Let’s get started!

The example dataset we are using is the Yelp Open Dataset. It contains different data, but we’ll be focusing on the reviews only.

To take a look at the data, let’s read it in chunks into Python. We only keep two features: *stars *ratings and *text *of the reviews.

We will build a model that can predict the sentiment of the reviews based on its text.

## Step #1: Preprocessing the Data for Sentiment Analysis

### Observing the Data

Before transforming the dataset *df_review_text*, let’s take a brief look at it.

We check for any missing values, which returns “num missing text: 0”.

We look at the distribution of the stars from the reviews.

We can see that people are positive to mainly give 4 or 5 stars.

And we also print out an example of the feature text.

### Defining the Sentiment

Wikipedia

Sentiment analysis(also known asopinion miningoremotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

To start the analysis, we must define the classification of sentiment.

What is a positive review? What is a negative review?

This is simple with the stars feature. We create a new feature *sentiment* with values 0 and 1*. *The reviews with stars above 3 are “positive”, with a value of 1. Others are “negative”, with a value of 0.

We can see that 65.84% are positive reviews.

### Splitting the Dataset into Train and Test

Next, we split the dataset into training and testing sets *df_train *and *df_test* by random shuffling.

df_test contains 1% of the original dataset. And it has a similar percentage of positive reviews as df_train.

### Experimenting with Sample Sizes

The Yelp dataset is easy to label with the feature stars. But in reality, we often don’t have such a dataset, which means manual labeling might be the only solution.

So we want to model with different sample sizes. In the end, we’ll compare the model performance with a pre-trained sentiment model.

We will use three different sample sizes of 200, 2,000, and 20,000.

The code below only demonstrates the 20,000 sample size. A new dataset df_train0 is created by taking the first 20,000 rows from df_train. The df_train0 is a random sample of the original dataset, since we shuffled the data when splitting the train and test datasets.

### Further Splitting the Dataset into Train and Validation

Also, we split df_train0 further to train and validation datasets as *df0_train *and *df0_val.*

### Setting up Target and Features

Then for both df0_train and df0_val, we set the sentiment as the target, and the text as the feature for the analysis.

### Preprocessing the Text: Tokenization and Conversion to Sequences

In this procedure, we transform the text to help the computer understand them better.

We limit the vocabulary length of the text and tokenize them. For an explanation about tokenization, take a look at How to use NLP in Python: a Practical Step-by-Step Example.

Then we transform each text in texts to a sequence of integers.

To print the distribution for number of words in the new sequence X_train_seq:

To look at an example of the tokenized and converted review text:

Now we have the data ready for modeling!

**Related article:** How to use NLP in Python: a Practical Step-by-Step Example

## Step #2: Tuning the Hyperparameters

As mentioned earlier, we are modeling the data with Long Short-Term Memory (**LSTM**) using **TensorFlow** **Keras** neural networks library.

Before fitting, we want to tune the hyperparameters of the model to achieve better performance. If you are not familiar with why and how to optimize the hyperparameters, please take a look at Hyperparameter Tuning with Python: Keras Step-by-Step Guide.

Within the below Python code, we define:

- the LSTM model in Keras
- the hyperparameters of the model
- the objective function/score for the hyperparameters optimization
- the training settings

Then we also set the limits for the values of hyperparameters that will be tuned.

We use the same package Ax to set up the experiment for hyperparameter tuning. Again, the details can be found in Hyperparameter Tuning with Python: Keras Step-by-Step Guide.

As you can see from the printed log, the Gaussian Process (*Sobol+GPEI*), a type of Bayesian Optimization method, is chosen in this exercise by Ax_client.

Now we can tune these hyperparameters. We run a small number of 20 trials and print the results.

The below table contains the score (keras_cv) and the combinations of hyperparameter values.

The best parameters can be printed below.

Let’s move on to fit the model using these hyperparameters.

**Related article:** Hyperparameter Tuning with Python: Keras Step-by-Step Guide

## Step #3: Fitting the LSTM model using Keras

### Training the Model

Using the above hyperparameters, we train the model below.

### Evaluating the Performance: ROC/AUC

We can use the model to predict classification of reviews for the test dataset.

And based on the above prediction, we can also look at the ROC/AUC of the model.

An

ROC curve(receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate and False Positive Rate. An ROC curve plots TPR vs. FPR at different classification thresholds.Google Developers

AUCstands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

To evaluate the model, we calculate the AUC for the LSTM model below.

The AUC is 0.9197, which is pretty good.

### Evaluating the Performance: Visualization

We can also visualize the classifications.

We can see that the majority of positive reviews (orange) has y_pred value closer to 1; and most of the negative reviews (blue) has y_pred closer to 0.

### Evaluating the Performance: by Sample Sizes

As you might recall, we ran the same procedures for different sample sizes of 200, 2,000, and 20,000.

We also want to compare the performance with a built-in sentiment tool TextBlob. Let’s use the TextBlob library to classify our test dataset as well.

We calculate its AUC, which is 0.8607.

We can also check the visualization of its classification results.

Lastly, let’s look at the performance of the LSTM models and Textblob together.

We use the below code to calculate the FPRs and TPRs.

Then we put the results all together.

The LSTM model with 20,000 sample size is a winner. But Textblob beat the smaller samples.

So when the labeled sample size is too small, save the effort and try the built-in classifier first!

Thank you for reading! Leave a comment if you have any questions. We’ll try our best to answer.

Before you leave, don’t forget to *sign up for the Just into Data newsletter* below! Or connect with us on Twitter, Facebook.

So you *won’t *miss any new data science articles from us!

## 2 thoughts on “How to do Sentiment Analysis with Deep Learning (LSTM Keras)<br /><div style='color:#7A7A7A;font-size: large;font-family:roboto;font-weight:400;'> Automatically Classify Reviews as Positive or Negative in Python</div>”

Kia411Hi Lianne and Justin,

It is a very nice story and I appreciate the instruction.

Just a quick question about the ax_platform, I tried to run your code on Jupyter Notebook. However, for using the service loop of ax_platform, Jupyter reported an error. I searched the error on google but found nothing. Just wonder do you know what causes the error? Thank you so much.

TypeError: type of val must be one of (Dict[str, Union[float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]]]], float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]], List[Tuple[Dict[str, Union[str, bool, float, int, NoneType]], Dict[str, Union[float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]]]]]], List[Tuple[Dict[str, Hashable], Dict[str, Union[float, numpy.floating, numpy.integer, Tuple[Union[float, numpy.floating, numpy.integer], Union[float, numpy.floating, numpy.integer, NoneType]]]]]]); got dict instead

During handling of the above exception, another exception occurred:

ValueError: Raw data must be data for a single arm for non batched trials.

Lianne & JustinSorry but we’re not too sure what is causing that error. Based on the message, it looks like your dictionary isn’t in the right format that the function expects.