In this post, we apply machine learning algorithms on YouTube data, to make recommendations on how to get more views.
We will include the end-to-end process of
- Scraping the YouTube data
- Using NLP on the video titles
- Feature engineering
- Building predictive decision trees
- And more
All in Python.
If you want to see how data science may help to get more views/revenues for YouTube channels, take a look.
Let’s get started.
The YouTube channel we are analyzing is Sydney Cummings — our favorite trainer. She has been growing fast and exceeded 200K subscribers recently.
Plus, Sydney has been posting various workout videos daily, which is a good amount of data to analyze.
Haven’t worked out today? Check out one of her recent videos below.
You might have noticed that the titles of her videos follow a standard format. They often include the length, the body parts, the calories burned, and other descriptive words about the workout. Before I click this video, I would know:
- 30 Minute — I will complete the whole workout in 30 minutes.
- SHREDDED Arms and STRONG Glutes— I will be working on arms and glutes and focusing on strength.
- Burn 310 Calories — I will be burning a decent amount of calories.
We will be analyzing this critical information.
Let’s see if we can recommend Sydney any new content creation strategies to improve even further!
The structure of the article is below for your convenience:
Preparation: Scraping the Data
There are different methods of scraping YouTube data. Since it is only a one-time project, we are doing it in the simplest way, which requires manual work, but no extra tools.
Below are the step-by-step procedures:
1. Scroll all the way down the channel’s videos page until all the videos appear.
2. Right-click the newest video and select “Inspect“.
3. Hover the cursor over each line to find the lowest level of HTML code/element that highlights all the videos.
For example, we use the Chrome browser, and it looks like this:

4. Right-click the element and select “Copy” then “Copy element”.
5. Paste the copied element into a text file and save it. We use the JupyterLab text file and save it as sydney.txt.
6. Use Python to extract information and clean up the data.
We are not explaining the details since each case is different. The code is here for your convenience.
Now we can start the fun part!
We will be extracting features from this dataset and studying the ones that impact the number of views.
Step #1: Observing the Data
Loading the data into Python is done in the last section; let’s take a look at our dataset df_videos.

df_videos has 8 features describing details about each video, including:
- title
- time posted_ago
- length
- views
- link
- calories
- date posted
- days_since_posted
And there are 837 videos posted.
Also, we find that there are duplicates in the data since Sydney had uploaded the same videos multiple times. But we’ll ignore it since the duplicates are few.

Step #2: Categorizing the Videos with NLP techniques
In this step, we focus on categorizing the videos based on the keywords in the titles.
We would like to group videos according to:
- Which parts of the body are this video focusing on?
- Whether this video is helping us to gain strength or lose fat?
- Other keywords?
We use NLP techniques with Natural Language Toolkit (NLTK) package to process the title.
Within this post, we are not explaining all the details of the techniques. Take a look at this post (How to use NLP in Python: a Practical Step-by-Step Example) if you are not familiar with them.
Forming the Lists of Keywords
First, we tokenize the titles of the videos.
This procedure explicitly splits the title text string into different tokens (words) with delimiters such as space (“ ”). In this way, the computer program can understand the text better.
There are 538 different words within these titles. The top of the list is below.

Many words are used frequently. This reaffirms that Sydney does use standard formatted video titles.
By looking at the above list, we create 3 lists of keywords that can be used to categorize the videos in the future steps.
- body_keywords — this identifies the body parts that the video is focusing on, such as “full” body, “abs”, “legs”.
- workout_type_keywords — this tells the workout types such as “cardio”, “stretch”, “strength”.
- other_keywords — this includes keywords frequently used but hard to categorize, such as “bootcamp”, “burnout”, “toning”.
Stemming the Lists of Keywords
After forming these lists of keywords, we also stem them. The stemming process makes sure the computer program can match the words with the same meaning.
For example, the words “abs” and “ab” have the same stem “ab”.
Tokenizing and Stemming the YouTube Titles
Besides keywords, we also need to tokenize and stem the titles.
These procedures prepare both the lists of keywords and titles for further matching.
Related article: How to use NLP in Python: a Practical Step-by-Step Example – To find out the In-Demand Skills for Data Scientists with NLTK
Now we are ready to build the features!
Step #3: Engineering the Features
After brainstorming, we came up with two main types of features related to Sydney’s YouTube views — keyword-based and time-based. Let’s look at them one by one.
Keyword-based Features
- Indicator Features
Thanks to the hard work from the previous step, we have 3 lists of keywords and the streamlined titles. We can now match them to categorize the videos.
For body_keywords and workout_type_keywords, there might be many keywords within one video. So before matching, we also create 2 features area and workout_type. These features concatenate all the body parts and workout types of one video into one string.
For instance, a workout video could do both “ab” and “leg”, or both “cardio” and “strength”. This video would have features area being “ab+leg” and workout_type being “cardio+strength”.
At the same time, we also identify similar keywords such as “total” and “full”, “core” and “ab” and group them.
Finally, we create three different types of dummy features:
- is_{}_area to identify whether a video contains a specific body part.
- is_{}_workout to identify the workout types.
- title_contains_{} to see if the workout title contains other keywords.
To be clear, a video title of “legs strength burnout workout” would have is_leg_area = True, is_strength_workout = True, and title_contains_burnout = True; while all the other indicators being False.
Read the below Python code for details.
- Frequency Features
Besides these indicators, we also create three features called num_body_areas, num_workout_types, and num_other_keywords. They count the number of keywords that are mentioned in one video’s title.
To give an example, the title “abs and legs cardio strength workout” has both num_body_areas and num_workout_types being 2.
These features help us identify the optimal numbers of body parts or workout types a video should include.
- Rate Features
Last but not least, we create a feature calories_per_min looking at the calories burning rate.
After all, we all want some clear (quantifiable) goals for our workout.
Source: Giphy
Before moving onto the time-based features, we also fix a few videos that are miscategorized. The process is manual, so we do not include them here.
Time Series-based Features
With the above keyword-based features, we can already find specific popular types of videos. But does that mean Sydney should always post the same types of videos?
To answer this question, we also create some time series-based features:
- num_same_area — the number of videos (including the current one) posted in the past 30 days that focused on the same area.
For example, this feature = 6 when the current video focuses on the upper body, and there were also 5 other upper body workouts in the last 30 days. - num_same_workout — this feature is like num_same_area except for counting the workout types.
For example, this feature = 3 when the current video is a HIIT workout, and there were also 2 other HIIT workouts in the last 30 days. - last_same_area — the number of days since the last video was also focusing on the same body area as the current video.
For example, this feature = 10 when the current video focuses on abs, and the previous abs video was 10 days ago. - last_same_workout — this feature is like last_same_area except for comparing the workout types.
- num_unique_areas — the number of unique body areas that were worked on in the past 30 days.
- num_unique_workouts — the number of unique workout types that were posted in the past 30 days.
These features help us see whether the audience prefers similar or various types of videos.
Take a look at the detailed process of feature engineering below. It involves some conversion to suit the rolling functions in Pandas.
We find out that Sydney posts non-workout-related videos occasionally. They get significantly less views and behave differently from the workout videos. So we remove them from the analysis.
We also filter out the first 30 days of videos since they lack enough historical data.
Test for Multicollinearity
Multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.
Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors.
Wikipedia
As Wikipedia explains, multicollinearity does impact the power of individual features on the outcome.
Why is this important?
Assume Sydney only posts strength workouts on Mondays, and her videos always get more views on Mondays. Do these videos get higher views because they’re posted on Mondays or because they’re strength workouts?
When making recommendations, we want to answer these types of questions. So we want to make sure there’s no strong collinearity among our features.
Now that we are clear about the reasons for testing for multicollinearity. Let’s see which method we should use.
Often we test collinearity using pair-wise correlations, but it’s not sufficient sometimes. There could be collinearity among multiple features (more than a pair) at the same time.
So instead, we use a more sophisticated method. At a high level, we use K-fold cross-validation to achieve this.
The detailed procedure is below:
- Select a group of critical features based on our judgment to test for collinearity.
We select the features below since they are essential for predicting the views of YouTube videos.
As you can see, we also add three features rand0, rand1, rand2 composed of random numbers. They serve as anchors when comparing the relationship between features. If one predictor feature is less important or similar compared to these random features, then it is not an important predictor of the target feature.
- Prepare these features for K-fold cross-validation.
In this process, we transform the categorical features area and workout_type. This transformation makes sure there is at least K number of values for each category level.
- Train predictive models using one of the features as the target and the rest as the predictors.
Next, we loop through each feature and fit a model to predict it using the other features. We use a simple model of Gradient Boosting Model (GBM) and K-fold validation.
Depending on whether the target feature is numeric or categorical, we apply different models and scores (model predictive power evaluation metrics).
When the target feature is numeric, we use the Gradient Boosting Regressor model and Root Mean Squared Error (RMSE); when the target feature is categorical, we use the Gradient Boosting Classifier model and Accuracy.
For each of the target, we print out the K-fold validation score (average of the scores) and the most important 5 predictors.
- Study the scores and the important predictors for each target feature.
We look into each of the target features and their relationship with the predictors. We won’t be covering the entire process but only explain two examples below.
We find out that the length and the calories features are correlated. This finding is intuitive because the longer the workout, the more calories likely burned.

We can also visualize the relationship.

As you can see, there is a positive correlation between length and calories. But it’s not strong enough for us to drop them. The calories burned in 40–45 minute videos overlap with 30–35 minute, 50–55 minute, and even 60+ minute videos. Hence, we are keeping both of them.
Also, we find out that num_same_area and area_full features are correlated. This finding is somewhat surprising. Let’s explore the reason.

The following chart shows the relationship between num_same_area and area.

The feature num_same_area counts the number of videos (including the current one) posted in the past 30 days that focused on the same area. The feature area_full represents a full-body workout, which is the most common type among Sydney’s videos.
As a result, when the num_same_area is large, the only possible area these videos focused on is the full body.
Assume we find out that the higher num_same_area (>=10) does cause higher YouTube views. We wouldn’t be able to know if it was because of the area_full or num_same_area. So we drop the num_same_area feature to prevent this situation.
Besides this, we drop num_same_workouts as well using similar logic.
Step #4: Creating the Target
As you may recall, the goal of this project is to increase YouTube views. Should we just use the number of views as our target?
The distribution of views is highly skewed. The median video received 27,641 views, while the most received 1.3 million. This skewness could cause problems for the interpretation of the model.
So, instead of views, we create feature views_quartile as the target.
We classify videos into two categories — highly viewed videos (“high”) and lesser viewed videos (“low”). “high” is defined as the 75th percentile of views (35,578) or above, and “low” is otherwise.
In this way, we use the predictive model to find the feature combinations that produced the top 25% viewed videos. This new target provides stable results and better insights.
Step #5: Building the Decision Tree
Finally, we have everything to build the model!
We train a decision tree model on the target views_quartile.
To avoid overfitting, we set the minimum sample of a leaf to be 10. To make the interpretation easier for us, we set the maximum depth of the tree to be 8 levels.

Related article: How to Visualize a Decision Tree in 5 Steps
Step #6: Reading the Decision Tree
In this last step, we look into and summarize “branches” that lead to either high or low views.
What are the main insights we find?
Insight #1: Calories Burned per Minute is the Most Important Feature
Yes, calories_per_min is the most important feature.
People don’t seem to care as much about the types of workout or the parts of the body.
Out of the workouts with higher calories per minute burned (≥ 12.025), 51/(34+51) = 60% of the videos had higher views.
While videos with less (≤ 9.846) calories burned per minute were much less popular than others. Only 12/(154+12) = 7.2% had a high number of views.
For the videos with medium calories per minute burned (between 9.846 and 12.025), other factors kicked into effect.
Of course, everyone wants to be efficient at burning calories!
Insight #2: Various Unique Workout on Different Body Parts doesn’t Boost the Views
This insight is somewhat different than what we imagined. Shouldn’t the variety of workouts be better?
When the number of unique body areas worked in the past month (num_unique_area) was high (≥ 10), the videos tended to have lower views. This statement was true even when the calories burned per minute were high.
Combining the previous two insights, 42/(12+42) = 78% of the videos received more views when
- the calories burned per minute was high (≥ 12.025) and
- the number of unique areas worked in the past month was less (< 10).
Too many body parts mentioned in the recent month might have created confusion for the audience.
Insight #3: Butt Workout is Popular
When a video burns fewer calories (calories_per_min ≤ 9.846), 5/(10+5) = 33% of them still received high views as long as it’s butt workout; yet, only 7/(144+7) = 4.6% non-butt had high views.
While we can’t see other specific body parts stand out in the tree, Sydney’s audience wants to work out the “butt” area!
Possible Recommendations to Get MORE Views
So, what are some strategies that we could recommend to Sydney?
Strategy #1: Burning more Calories 🔥
As we can see, calories burned per minute is the most crucial feature. 12.025 calories per minute burned seems to be the magic number.
The following table is a good starting point for how many calories should be burned for popular videos of different length:
- 30 minute workout: 361 calories
- 40 minute workout: 481 calories
- 50 minute workout: 601 calories
- 60 minute workout: 722 calories
We suspect that the displaying of the numbers (of length and calories) is psychological. People may like seeing the first two digits of calories form a much larger number than the length.
Strategy #2: Using less Different Body Parts Keywords 🔻
Sometimes less is more.
People dislike too many unique body parts described in the workout titles. According to our model, focusing on less than 10 body parts combinations within a month is better.
We have noticed Sydney using fewer body parts keywords in her recent videos. The most obvious one is she’s been using “arms” or “upper body” rather than words such as “biceps” or “back”.
Strategy #3: Create more Butt Workouts 🍑
Sydney’s subscribers are likely more ladies, and they tend to focus on working the “butt” rather than gaining muscular arms. People are willing to sacrifice burning less calories to get a more toned butt. Perhaps Sydney should always include some butt work for the videos burning fewer calories.
Extra Strategies 📅5️⃣
Besides the above strategies, there are also other ideas worth further investigation.
For example, Sydney can try to:
- Launch new campaigns at the beginning of the month. 📅
Videos posted at the beginning of the month are more likely to get higher views. Perhaps people like setting up new goals to kick-starting a new month. - Avoid posting the same types of workout within 5 days. 5️⃣
This is an application we are trying to explore to boost YouTube views. There are limitations:
- These recommendations are based on insights from the past. YouTubers tend to try innovative ideas outside the norm of their past routine. Given that said, we could apply machine learning to their established competitors to get insights.
- We are focusing the analysis on titles only. There are also other data, such as the caption of the videos that could be scraped. They might contain valuable insights, as well.
- We have less data than the owner of the YouTube channel. There is other critical information, such as subscribers demographics. There might be more features, more insights, and better explanations of the insights.
Thank you for reading. We hope you found this article an interesting read.
Leave a comment to let us know your thoughts!