How to use NLP in Python: a Practical Step-by-Step Example
 To find out the In-Demand Skills for Data Scientists with NLTK

Lianne & Justin

Lianne & Justin

Share on twitter
Share on linkedin
Share on facebook
Share on email
NLP brain processing information
Source: Burst

In this article, we present a step-by-step NLP application on Indeed job postings.

It is the technical explanation of the previous article, in which we summarized the in-demand skills for data scientists. We provided the top tools, skills, and minimum education required most often by employers.

If you want to see a practical example using Natural Language Toolkit (NLTK) package with Python code, this post is for you.

Let’s dive right in.


The table of contents is below for your convenience.


Preparation: Scraping the Data

We scrape the job postings for “data scientists” from Indeed for 8 different cities. Upon scraping, we download the data into separate files for each of the cities.

The 8 cities included in this analysis are Boston, Chicago, Los Angeles, Montreal, New York, San Francisco, Toronto, and Vancouver. The variables are job_title, company, location, and job_description.

We are not going into details for this process within this article. Take a look at the code here if you’re interested.

We are ready for the real analysis! We’ll summarize the popular tools, skills, and minimum education required by the employers from this data.

Step #1: Loading and Cleaning the Data

First, we load and combine the data files of the 8 cities into Python.

We remove duplicate rows/job postings with the same job_title, job_description,and city features.

Now we have a dataset of 5 features and 2,681 rows.

indeed job postings dataset

Related article: Data Cleaning in Python: the Ultimate Guide (2020)

Step #2: Forming the Lists of Keywords

Before searching in the job descriptions, we need lists of keywords that represent the tools/skills/degrees.

For this analysis, we use a simple approach to forming the lists. The lists are based on our judgment and the content of the job postings. You may use more advanced approaches if the task is more complicated than this.

For the list of keywords of tools, we initially come up with a list based on our knowledge of data science. We know that the popular tools for data scientists include Python, R, Hadoop, Spark, and more. We have a decent knowledge of the field. So this initial list is good to have covered many tools mentioned in the job postings.

Then we look at random job postings and add tools that are not on the list yet. Often these new keywords remind us to add other related tools as well.

After this process, we have a keyword list that covers most of the tools mentioned in the job postings.

Next, we separate the keywords into a single-word list and a multi-word list. We need to match these two lists of keywords to the job description in different ways.

With simple string matches, the multi-word keyword is often unique and easy to identify in the job description.

The single-word keyword, such as “c” is referring to C programming language in our article. But “c” is also a common letter that is used in many words including “can”, “clustering”. We need to process them further (through tokenization) to match only when there is a single letter “c” in the job descriptions.

Below are our lists of keywords for tools coded in Python.

We get lists of keywords for skills by following a similar process as tools.

For education level, we use a different procedure.

Because we are looking for the minimum required education level, we need a numeric value to rank the education degree. For example, we use 1 to represent “bachelor” or “undergraduate”, 2 to represent “master” or “graduate”, and so on.

In this way, we have a ranking of degrees by numbers from 1 to 4. The higher the number, the higher the education level.

Step #3: Streamlining the Job Descriptions using NLP Techniques

In this step, we streamline the job description text. We make the text easier to understand by computer programs; and hence more efficient to match the text with the lists of keywords.

The job_description feature in our dataset looks like this.

job descriptions text

Tokenizing the Job Descriptions

Tokenization is a process of parsing the text string into different sections (tokens). It is necessary since the computer programs understand the tokenized text better.

We must explicitly split the job description text string into different tokens (words) with delimiters such as space (“ ”). We use the word_tokenize function to handle this task.

After this process, the job description text string is partitioned into tokens (words) as below. The computer can read and process these tokens easier.

For instance, the single-word keyword “c” can only match with tokens (words) “c”, rather than with other words “can” or “clustering”.

job descriptions text tokens

Please read on for the Python code. We combine tokenization with the next few procedures together.

Parts of Speech (POS) Tagging the Job Descriptions

The job descriptions are often long. We want to keep the words that are informative for our analysis while filtering out others. We use POS tagging to achieve this.

The POS tagging is an NLP method of labeling whether a word is a noun, adjective, verb, etc. Wikipedia explains it well:

POS tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Thanks to the NLTK, we can use this tagger with Python.

Applying this technique on the lists of keywords, we can find tags related to our analysis.

Below, we POS tag the list of keywords for tools as a demonstration.

Different combinations of letters represent the tags. For instance, NN stands for nouns and singular words such as “python”, JJ stands for adjective words such as “big”. The full list of representations is here.

As we can see, the tagger is not perfect. For example, “sql” is tagged as “JJ” — adjective. But it is still good enough to help us filtering for useful words.

job descriptions text tags

We use this list of tags of all the keywords as a filter for the job descriptions. We keep only the words from the job descriptions that have these same tags of keywords. For example, we would keep the words from job descriptions with tags “NN” and “JJ”. By doing this, we filter out the words from the job descriptions such as “the”, “then” that are not informative for our analysis.

At this stage, we have streamlined job descriptions that are tokenized and shortened.

Stay patient! We only need to process them a little more.

Step #4: Final Processing of the Keywords and the Job Descriptions

In this step, we process both the lists of keywords and the job descriptions further.

Stemming the Words

Word stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form — generally a written word form.

The stemming process allows computer programs to identify the words of the same stem despite their different look. In this way, we can match words as long as they have the same stem. For instance, the words “models”, “modeling” both have the same stem of “model”.

We stem both the lists of keywordsand the streamlined job descriptions.

Lowercasing the Words

Lastly, we standardize all the words by lowercasing them. We only lowercase the job descriptions since the lists of keywords are built in lowercase.

As mentioned in the previous sections, the Python code used in the previous procedures is below.

Now only the words (tokens) in the job descriptions that are related to our analysis remain. An example of a final job description is below.

job descriptions final dataset

Finally, we are ready for keyword matching!

Step #5: Matching the Keywords and the Job Descriptions

To see if a job description mentions specific keywords, we match the lists of keywords and the final streamlined job descriptions.

Tools/Skills

As you may recall, we built two types of keyword lists — the single-word list and the multi-word list. For the single-word keywords, we match each keyword with the job description by the set intersection function. For the multi-word keywords, we check whether they are sub-strings of the job descriptions.

Education

For the education level, we use the same method as tools/skills to match keywords. Yet, we only keep track of the minimum level.

For example, when the keywords “bachelor” and “master” both exist in a job description, the bachelor’s degree is the minimum education required for this job.

The Python code with more details is below.

Step #6: Visualizing the Results

We summarize the results with bar charts.

For each particular keyword of tools/skills/education levels, we count the number of job descriptions that match them. We calculate their percentage among all the job descriptions as well.

For the lists of tools and skills, we are only presenting the top 50 most popular ones. For the education level, we summarize them according to the minimum level required.

The detailed Python code is below.

Top Tools In-Demand

Top 50 Tools for Data Scientists
Top 50 Tools for Data Scientists

Top Skills In-Demand

Top 50 Skills for Data Scientists
Top 50 Skills for Data Scientists

Minimum Education Required

Minimum Education Level for Data Scientists
Minimum Education Level for Data Scientists

We did it!

We hope you found this article helpful. Leave a comment to let us know your thoughts.

Again, if you want to see the detailed results, read What are the In-Demand Skills for Data Scientists in 2020.

Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on facebook
Facebook
Share on email
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top
We use cookies to ensure you get the best experience on our website.  Learn more.