In this guide, we’ll discuss the procedures of building a data science pipeline in practice.
Data science is useful to extract valuable insights or knowledge from data. As data analysts or data scientists, we are using data science skills to provide products or services to solve actual business problems. Thus, it’s critical to implement a well-planned data science pipeline to enhance the quality of the final product.
Following this tutorial, you’ll learn the pipeline connecting a successful data science project, step-by-step.
If you are looking to apply machine learning or data science in the industry, this guide will help you better understand what to expect.
Let’s jump in!
- What is a Data Science Pipeline?
- Pipeline prerequisite: Understand the Business Needs
- Step #1: Collect the Data
- Step #2: Clean and Explore the Data
- Step #3: Research Methodologies
- Step #4: Build and Evaluate the Model
- Step #5: Present the Results
- Step #6: Launch into Production
- Step #7: Maintain the Product
What is a Data Science Pipeline?
In this tutorial, we focus on data science tasks for data analysts or data scientists. The data science pipeline is a collection of connected tasks that aims at delivering an insightful data science product or service to the end-users. The responsibilities include collecting, cleaning, exploring, modeling, interpreting the data, and other processes of the launching of the product. The delivered end product could be:
- A machine learning model that recommends products to the customers.
- Reports on sales and expenses delivered periodically to decision-makers.
- Insights on customer behaviour when using the products.
Although they have different targets and end-forms, the processes of generating the products follow similar paths in the early stages.
Data science professionals need to understand and follow the data science pipeline. The pipeline involves both technical and non-technical issues that could arise when building the data science product. A well-planned pipeline will help set expectations and reduce the number of problems, hence enhancing the quality of the final products.
Below we summarized the workflow of a data science pipeline.
Let’s go through them one-by-one.
Pipeline prerequisite: Understand the Business Needs
Before we start any projects, we should always ask:
What is the Question we are trying to answer?
The end product of a data science project should always target to solve business problems. So it’s essential to understand the business needs. Asking the right question sets up the rest of the path. Otherwise, you’ll be in the dark on what to do and how to do it.
In this initial stage, you’ll need to communicate with the end-users to understand their thoughts and needs.
Every company or team is different.
Your business partners may come to you with questions in mind, or you may need to discover the problems yourself.
Some companies have a flat organizational hierarchy, which is easier to communicate among different parties. Some are more complicated, in which you might have to communicate indirectly through your supervisors or middle teams.
After the communications, you may be able to convert the business problem into a data science project. You should have found out answers for questions such as:
- What will the final product be used for?
- Which type of analytic methods could be used? Is this a problem that data science can help?
- What are the KPIs that the new product can improve?
- Can this product help with making money or saving money?
- What are the constraints of the production environment?
Although ‘understand the business needs’ is listed as the prerequisite, in practice, you’ll need to communicate with the end-users throughout the entire project. It’s not possible to understand all the requirements in one meeting, and things could change while working on the product.
Commonly Required Skills: Communication, Curiosity
Step #1: Collect the Data
After the initial stage, you should know the data necessary to support the project.
It’s time to investigate and collect them.
Whether this step is easy or complicated depends on data availability. If you are lucky to have the data in an internal place with easy access, it could be a quick query. Yet many times, this step is time-consuming because the data is scattered among different sources such as:
- spreadsheets
- existing internal databases
- public available data
The size and culture of the company also matter. In a small company, you might need to handle the end-to-end process yourself, including this data collection step. In a large company, where the roles are more divided, you can rely more on the IT partners’ help.
At the end of this stage, you should have compiled the data into a central location.
It’s always important to keep in mind the business needs. If the product or service has to be delivered periodically, you should plan to automate this data collection process.
Commonly Required Skills: Excel, relational databases like SQL, Python, Spark, Hadoop
Further Readings:
SQL Tutorial for Beginners: Learn SQL for Data Analysis
Quick SQL Database Tutorial for Beginners
Learn Python Pandas for Data Science: Quick Tutorial
Step #2: Clean and Explore the Data
In this step, you’ll need to transform the data into a clean format so that the machine learning algorithm can learn useful information from it. Data, in general, is messy, so expect to discover different issues such as missing, outliers, and inconsistency. This step will often take a long time as well.
Exploratory data analysis (EDA) is also needed to know the characteristics of the data inside and out.
Again, it’s better to keep in mind the business needs to automate this process.
Although this is listed as Step #2, it’s tightly integrated with the next step, the data science methodologies we are going to use. Depending on the dataset collected and the methods, the procedures could be different.
Commonly Required Skills: Python
Further Reading: Data Cleaning in Python: the Ultimate Guide
How to use Python Seaborn for Exploratory Data Analysis
Python NumPy Tutorial: Practical Basics for Data Science
Learn Python Pandas for Data Science: Quick Tutorial
Introducing Statistics for Data Science: Tutorial with Python Examples
Step #3: Research Methodologies
You should research and develop in more detail the methodologies suitable for the business problem and the datasets. Within this step, try to find answers to the following questions:
- What models have worked well for this type of problem?
- What features might be useful?
- How would we get this model into production?
- How would we evaluate the model? What metric(s) would we use?
Commonly Required Skills: Machine Learning / Statistics, Python, Research
Further Reading: Machine Learning for Beginners: Overview of Algorithm Types
Step #4: Build and Evaluate the Model
You can try different models and evaluate them based on the metrics you came up with before.
Each model trained should be accurate enough to meet the business needs, but also simple enough to be put into production. For example, the model that can most accurately predict the customers’ behavior might not be used, since its complexity might slow down the entire system and hence impact customers’ experience. It’s critical to find a balance between usability and accuracy.
Commonly Required Skills: Python
Further Readings: Practical Guide to Cross-Validation in Machine Learning
Hyperparameter Tuning with Python: Complete Step-by-Step Guide
8 popular Evaluation Metrics for Machine Learning Models
Step #5: Present the Results
Most of the time, either your teammate or the business partners need to understand your work.
So it’s common to prepare presentations that are customized to the audience. You should create effective visualizations to show the insights and speak in a language that resonates with their business goals. Don’t forget that people are attracted to stories. If you can make up a good story, people will buy into your product more comfortable.
Commonly Required Skills: Python, Tableau, Communication
Further Reading: Elegant Pitch
Step #6: Launch into Production
This is the most exciting part of the pipeline. We are finally ready to launch the product!
Yet, the process could be complicated depending on the product.
If it’s an annual report, a few scripts with some documentation would often be enough. If it’s a model that needs to take action in real-time with a large volume of data, it’s a lot more complicated. For example, a recommendation engine for a large website or a fraud system for a commercial bank are both complicated systems.
When the product is complicated, we have to streamline all the previous steps supporting the product, and add measures to monitor the data quality and model performance.
The procedure could also involve software development. We need strong software engineering practices to make it robust and adaptable. The code should be tested to make sure it can handle unexpected situations in real life.
Commonly Required Skills: Software Engineering, might also need Docker, Kubernetes, Cloud services, or Linux
Step #7: Maintain the Product
After the product is implemented, it’s also necessary to continue the performance monitoring.
As mentioned earlier, the product might need to be regularly updated with new feeds of data. Or as time goes, if the performance is not as expected, you need to adjust, or even retire the product.
Great stuff!
These are all the general steps of a data science or machine learning pipeline. As you can see, there’re many things a data analyst or data scientist need to handle besides machine learning and coding.
We are solving real business problems!
Hope you get a better idea of how data science projects are carried out in real life.
Leave a comment for any questions you may have or anything else!