This is an ultimate SQL tutorial to learn SQL for data analysis (from beginner to advanced), with examples for practice.
SQL is an essential skill that’s highly in-demand for data science/machine learning. The best way to learn SQL fast is by practicing in a real professional environment (relational database management system).
Following this tutorial, you’ll discover:
- How to learn SQL on a sample database on SQL server, for FREE.
- How to write SELECT statements (all you need to know), SQL joins.
- How to write advanced SQL subqueries, window functions.
- Lots More.
Whether you are a beginner of SQL, or you want a cheat sheet for interviews of SQL, you’ll love this practical guide.
Let’s get started!
Related article: What are the In-Demand Skills for Data Scientists in 2020
Prerequisite: FAQs and Install SQL Server with Sample Database
First of all, if you still have questions such as:
- What is SQL?
- Why learn SQL? Is SQL required for data science?
- Is SQL easy to learn? How long does it take to learn SQL?
Please find answers to these FAQs at the FAQ section of this article.
To practice SQL in a professional environment, you may install a Microsoft SQL server and a sample database (AdventureWorks) for FREE.
Please follow the guide below with step-by-step instructions.
If this is not an option for you, you can still follow along with this tutorial for practice, either by typing the queries on Notepad or through the W3Schools SQL simulator.
But you won’t get the benefits of querying on sample tables that are similar to real world practice.
We’ll start with the basics of SQL queries on one table. Then move on to SQL queries for multiple tables. And we’ll finish off with more complicated subqueries and window functions.
We’ll start by introducing the syntax for each SQL query statement/clause. And to make the learning more practical, there are also examples based on the sample database.
Let’s jump in.
SELECT Statement (*, DISTINCT, INTO)
Within the sample database AdventureWorks, there are many tables that are related to each other.
This is how a real world relational database looks like.
Every table is composed of columns and rows.
For example, the HumanResources.Employee table below has columns such as Business EntityID, NationalIDNumber, and LoginID. Every row/record represents a unique employee in the company.
All the SQL queries use the SELECT keyword to extract information from the database.
The syntax of the most basic SELECT statement is below.
-- "--" is used to put comments in Microsoft SQL server -- SQL keywords are NOT case sensitive, but we'll use uppercase -- select everything from a table SELECT * From table_name -- select specific columns SELECT column1, column2, ..., columnN FROM table_name -- select distinct combinations of values for specific columns SELECT DISTINCT column1, column2, ..., columnN FROM table_name -- select copies the results table from the queries into a new table -- this is useful when we want to breakdown the data extraction queries SELECT * INTO newtable_name FROM oldtable_name
Question: What are all the employees’ national IDs, job titles, and birth dates based on the Employee table below?
And what are the different job titles?
SELECT NationalIDNumber,JobTitle,BirthDate FROM AdventureWorks.HumanResources.Employee SELECT DISTINCT JobTitle FROM AdventureWorks.HumanResources.Employee
WHERE (LIKE, IN, BETWEEN; AND, OR, NOT)
What if we want to see records satisfying certain conditions?
We can filter using the WHERE clause.
SELECT column1, column2, ... , columnN FROM table_name WHERE <condition>
Within the WHERE clause, we can use operators:
- <>, which is not equal.
- LIKE, which can do a wildcard search for a specific pattern.
There are two wildcard holders:
– %: used to represent >= 0 characters.
– _ : represents a single character.
For example, column1 LIKE ‘%data%’ (search for records with column1’s value contains the word ‘data’).
- IN, which can specify a set of possible values.
For example, column1 IN (value1, value2, …)
- BETWEEN, which can specify a range of values.
For example, column1 BETWEEN value1 AND value2.
We can use logic operators AND, OR, NOT to combine statements as well.
Question: Find a list of employees that are salaried, with job title starts with ‘Sales’ and vacation hours over 15.
SELECT * FROM AdventureWorks.HumanResources.Employee WHERE SalariedFlag = 1 --the % is used to represent >= 0 characters; _ represents a single character; we can use it to form different patterns AND JobTitle LIKE 'Sales%' AND VacationHours > 15
What if we want to sort the result by specific columns?
We can use the ORDER BY keyword.
SELECT * FROM table_name ORDER BY column1, column2, ... , columnN ASC/DESC -- ASC for ascending order, DESC for descending order
Question: List all the sales orders that are online and sort them by sub total from largest to smallest.
SELECT * FROM AdventureWorks.Sales.SalesOrderHeader WHERE OnlineOrderFlag = 1 ORDER BY SubTotal DESC
What if we only want to look at a top list from the table?
We can use the TOP Clause.
This is also often used when we just want to take a look at the table.
SELECT TOP <N> column1, column2, ... , columnN FROM table_name
Question: List the top 10 salespeople by Sales YTD.
SELECT TOP 10 * FROM AdventureWorks.Sales.SalesPerson ORDER BY SalesYTD DESC
What if we want summary statistics for the tables?
We can use the aggregation functions.
SELECT AggFunction0(column1), AggFunction1(column2), ... FROM table_name WHERE <condition>
The popular functions are:
- MIN: minimum of the values of the column
- MAX: maximum of the values of the column
- AVG: average of the values of the column
- COUNT: number of records of the column
- SUM: sum of values of the column
Question: What are the minimum, maximum, and average ages of all employees.
SELECT -- DATEDIFF is also a function, which returns the difference between two date values -- we can use the AS to name variable name, which is not mandatory MIN(DATEDIFF(year, BirthDate, '2020-04-30')) AS min_age, MAX(DATEDIFF(year, BirthDate, '2020-04-30')) AS max_age, AVG(DATEDIFF(year, BirthDate, '2020-04-30')) AS avg_age FROM AdventureWorks.HumanResources.Employee
GROUP BY (HAVING)
How to get summary statistics for different groups in the data?
We often use the Aggregation functions together with GROUP BY.
Select <groupby columns>, AggFunction0(ValueColumn0), AggFunction1(ValueColumn0), ... FROM table_name GROUP BY <groupby columns> HAVING <conditions> -- we don't need to put <groupby columns> in the columns section, but it is almost always included.
We can also use HAVING with the GROUP BY queries, which is similar to the WHERE statement. Since WHERE can’t be used to filter conditions for aggregate functions.
Question: How much net profit (Sales – Cost) did the different Sales Territories (CountryRegionCode) make last year, from largest to smallest?
SELECT CountryRegionCode, SUM(SalesLastYear - CostLastYear) as ProfitLastYear FROM AdventureWorks.Sales.SalesTerritory GROUP BY CountryRegionCode --we can also use HAVING to specify a threshold of net profit, eg., over 5000000 --HAVING SUM(SalesLastYear - CostLastYear) > 5000000 ORDER BY SUM(SalesLastYear - CostLastYear) DESC
Related article: How to GroupBy with Python Pandas Like a Boss
We can also use the CASE statement, which is similar to IF-ELSE statements in other programming languages.
This is especially useful when we want to customize the groups and provide summary statistics. Check out the example for details.
SELECT CASE WHEN <condition0> THEN <value0> WHEN <condition1> THEN <value1> ... ELSE <value> END as <columnName> ... FROM table_name ...
The CASE statement goes through conditions and returns a value in order. WHEN the condition0 is met, it returns value0. Otherwise, it moves on to the next WHEN condition, and so on.
Once a condition is true, it will stop moving down and return the result. If none of the conditions are true, it returns the value under the ELSE. If there is no ELSE as well, it returns NULL.
Question: Count how many Employees are between 0 and 20 years old, 20 and 40 years old, 40 and 60, 60+.
SELECT CASE WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 20 THEN '[0 - 20) years old' WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 40 THEN '[20 - 40) years old' WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 60 THEN '[40 - 60) years old' ELSE '60+ years old' END as age_cat, COUNT(*) as cnt FROM AdventureWorks.HumanResources.Employee GROUP BY CASE WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 20 THEN '[0 - 20) years old' WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 40 THEN '[20 - 40) years old' WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 60 THEN '[40 - 60) years old' ELSE '60+ years old' END
So far, we’ve been extracting information from one table. While most relational databases in production have information separated into different tables.
Each table usually contains information for a specific purpose. For example, we have sales tables, customer tables, product tables in our sample database.
If we want to combine data from multiple tables easily, we must use JOIN.
Imagine we have table 1 and table 2. There are 4 main ways we want data from them (source: W3 school):
- (INNER) JOIN: Returns records that have matching values in both tables
- LEFT (OUTER) JOIN: Returns all records from the left table, and only the matched records from the right table
- RIGHT (OUTER) JOIN: Returns all records from the right table, and only the matched records from the left table
- FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table
It is easier to understand with the Venn diagrams below:
When using JOINS, we must use the columns of the same definition from the two tables as identifiers. This is how we link the two tables together.
SELECT table1.column1, table2.column1, ... FROM table 1 AS A (INNER/LEFT/RIGHT/FULL) JOIN table 2 AS B ON A.Key = B.Key --key identical variables to join them together, they might have different column names
Question: We have two tables HumanResources.Employee and Person.Person. Both tables have key columns BusinessEntityID.
How can we get the loginID, job title, birth date, first name, last name of all the employees?
SELECT emp.LoginID, emp.JobTitle, emp.BirthDate, contact.FirstName, contact.LastName -- HumanResources.Employee emp is the same as HumanResources.Employee AS emp, we can ignore the AS FROM AdventureWorks.HumanResources.Employee emp -- we use left join since we want all the employees, which is from the left table LEFT JOIN AdventureWorks.Person.Person contact on (emp.BusinessEntityID = contact.BusinessEntityID)
The LEFT JOIN will return all the rows from the HumanResources.Employee table, no matter whether there’s a match in the Person.Person table.
We can also join multiple tables at once. Let’s see how it works.
Question: List the sales amount by products for shipment to city Bothell.
It sounds like a simple question. But the information is from multiple tables:
- The SalesOrderHeader has the orders basic information.
- The Address has the city information.
- The SalesOrderDetail has the LineTotal, which we assume is the amount of sales.
- The Product has the product information.
How do we piece them together?
Luckily, these tables have common keys we can join them together.
SELECT adds.City, prod.name, SUM(sod.LineTotal) as product_sales FROM -- to simplify, we are using all inner join to get the information that are available in all tables AdventureWorks.Sales.SalesOrderHeader soh INNER JOIN AdventureWorks.Person.Address adds on (soh.ShipToAddressID = adds.AddressID) INNER JOIN AdventureWorks.Sales.SalesOrderDetail sod on (soh.SalesOrderID = sod.SalesOrderID) INNER JOIN AdventureWorks.Production.Product prod on (sod.ProductID = prod.ProductID) WHERE adds.City = 'Bothell' GROUP BY adds.City, prod.name ORDER BY SUM(sod.LineTotal) DESC
SQL can execute nested queries. A subquery is one SELECT statement that is embedded within another SELECT statement. The innermost subquery gets executed first, then the next level.
This is when SQL starts getting more complicated.
A good way of solving a complex problem is to break down into smaller questions/queries and piece them together.
The syntax is the same as the basic SELECT statement.
Question: Find all the sales details of the customer that spent the most money.
As mentioned, it’s good to breakdown the problem.
First, we need to know which tables have the information we need:
- Is it all in one table?
- Or it’s in multiple tables that I’ll need to JOIN.
Then, we might need to write subqueries first. For example, I wrote the subquery to grab the TOP 1 customer that spent the most money.
Then, we can piece things together.
SELECT sod.* FROM AdventureWorks.Sales.SalesOrderDetail sod INNER JOIN AdventureWorks.Sales.SalesOrderHeader soh1 ON (sod.SalesOrderID = soh1.SalesOrderID) WHERE soh1.CustomerID IN -- the below subquery looks for the top 1 customer by totaldue amount (SELECT TOP 1 soh.CustomerID FROM AdventureWorks.Sales.SalesOrderHeader soh GROUP BY soh.CustomerID ORDER BY SUM(soh.TotalDue) DESC)
SQL Window Functions
Lastly, let’s talk about the SQL window functions.
We haven’t had to use window functions much during our data analysis with SQL. Most of the time, you can get away using other queries without it. But it is still handy.
Determines the partitioning and ordering of a rowset before the associated window function is applied. That is, the OVER clause defines a window or user-specified set of rows within a query result set.
A window function then computes a value for each row in the window.
You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.OVER clause documentation
OVER ( [ <PARTITION BY clause> ] [ <ORDER BY clause> ] [ <ROW or RANGE clause> ] )
For details about syntax, read the OVER clause documentation.
It is easier to explain with examples. You may read the examples first and use the documentation to understand the full picture.
Question: Find the latest sales order for each credit card ID within the table below.
SELECT * FROM ( --get a table with sales and the rank by orderdate for each creditcardID SELECT c.*, RANK () OVER ( PARTITION BY c.CreditCardID --similar to group by ORDER BY c.OrderDate DESC --order by orderdate to get the RANK() ) AS OrderDate_rank FROM AdventureWorks.Sales.SalesOrderHeader c ) windows_function WHERE windows_function.OrderDate_rank = 1 --get the number 1 rank based on the most recent orderdate
Question: Find the top 3 customers that spent the most for the product ID of 765.
Again, it’s better to breakdown the complex problem into smaller questions first:
- Step #1: How do we get the customers and their spending who have purchased productID = 765
- Step #2: How do we get the rank of customers by spending
- Step #3: How do we get the top rank customers
When reading and testing the code, you can also run the subqueries that solved the first two questions to check.
--Step #3: get the top rank customer using WHERE SELECT * FROM ( --Step #2: get the rank of top customers by spending using a window function SELECT cps.ProductID, cps.CustomerID, cps.product_spend, RANK () OVER ( PARTITION BY cps.ProductID ORDER BY cps.product_spend DESC ) AS customer_rank FROM --Step #1: get the customers and their spending who have purchased productID = 765 by joining multiple tables (SELECT prod.ProductID, soh.CustomerID, SUM(sod.LineTotal) as product_spend FROM AdventureWorks.Production.Product prod LEFT JOIN AdventureWorks.Sales.SalesOrderDetail sod ON (prod.ProductID = sod.ProductID) LEFT JOIN AdventureWorks.Sales.SalesOrderHeader soh ON (soh.SalesOrderID = sod.SalesOrderID) WHERE prod.ProductID = 765 GROUP BY prod.ProductID,soh.CustomerID) cps ) a WHERE a.customer_rank <= 3
Related article: How to GroupBy with Python Pandas Like a Boss
Window functions are similar to the transform functions in Python Pandas.
Always Breakdown the Complicated Problems
This tip is already mentioned in the article. But it is critical, so we want to reiterate.
It is easier to breakdown the problem into smaller questions. And write queries to solve the smaller questions before combining them.
SQL Queries Order
SQL evaluates the queries clauses in the order of:
- FROM (JOIN)
- GROUP BY
- ORDER BY
Not every query needs to have all of the above. When it does, be aware of the order!
That’s it for this SQL tutorial for beginners! You should have mastered all you need to learn about SQL queries for data analysis!
To learn about SQL database management commands, check out Quick SQL Database Tutorial for Beginners.
Leave a comment for any questions you may have or anything else.
Related “Break into Data Science” articles:
Check out this for a detailed review of resources online, including courses, books, free tutorials, portfolios building, and more.
What are the In-Demand Skills for Data Scientists in 2020
In this article, we extracted information from Indeed job postings. You’ll see why we recommend setting up the data science environment with SQL and Python.