SQL Tutorial for Beginners: Learn SQL for Data Analysis
 From zero to HERO

Lianne & Justin

Lianne & Justin

SQL Tutorial for Beginners Learn SQL for Data Analysis from zero to hero superman
Source: Unsplash

This is an ultimate SQL tutorial to learn SQL for data analysis (from beginner to advanced), with examples for practice.

SQL is an essential skill that’s highly in-demand for data science/machine learning. The best way to learn SQL fast is by practicing in a real professional environment (relational database management system).

Following this tutorial, you’ll discover:

  • How to learn SQL on a sample database on SQL server, for FREE.
  • How to write SELECT statements (all you need to know), SQL joins.
  • How to write advanced SQL subqueries, window functions.
  • Lots More.

Whether you are a beginner of SQL, or you want a cheat sheet for interviews of SQL, you’ll love this practical guide.

Let’s get started!

Related article: What are the In-Demand Skills for Data Scientists in 2020



Prerequisite: FAQs and Install SQL Server with Sample Database

First of all, if you still have questions such as:

  • What is SQL?
  • Why learn SQL? Is SQL required for data science?
  • Is SQL easy to learn? How long does it take to learn SQL?

Please find answers to these FAQs at the FAQ section of this article.

To practice SQL in a professional environment, you may install a Microsoft SQL server and a sample database (AdventureWorks) for FREE.

Please follow the guide below with step-by-step instructions.

If this is not an option for you, you can still follow along with this tutorial for practice, either by typing the queries on Notepad or through the W3Schools SQL simulator.
But you won’t get the benefits of querying on sample tables that are similar to real world practice.


We’ll start with the basics of SQL queries on one table. Then move on to SQL queries for multiple tables. And we’ll finish off with more complicated subqueries and window functions.

We’ll start by introducing the syntax for each SQL query statement/clause. And to make the learning more practical, there are also examples based on the sample database.

Let’s jump in.


SELECT Statement (*, DISTINCT, INTO)

Within the sample database AdventureWorks, there are many tables that are related to each other.

This is how a real world relational database looks like.

sql server sample database adventureworks diagrams
Source: AdventureWorks database diagram

Every table is composed of columns and rows.

For example, the HumanResources.Employee table below has columns such as Business EntityID, NationalIDNumber, and LoginID. Every row/record represents a unique employee in the company.

All the SQL queries use the SELECT keyword to extract information from the database.

The syntax of the most basic SELECT statement is below.

Syntax

-- "--" is used to put comments in Microsoft SQL server
-- SQL keywords are NOT case sensitive, but we'll use uppercase
-- select everything from a table
SELECT * From table_name
-- select specific columns
SELECT column1, column2, ..., columnN FROM table_name
-- select distinct combinations of values for specific columns
SELECT DISTINCT column1, column2, ..., columnN FROM table_name
-- select copies the results table from the queries into a new table
-- this is useful when we want to breakdown the data extraction queries
SELECT * INTO newtable_name FROM oldtable_name

Examples

Question: What are all the employees’ national IDs, job titles, and birth dates based on the Employee table below?
And what are the different job titles?

sql server sample database adventureworks diagrams employee table
Source: AdventureWorks database diagram

Answer:

SELECT NationalIDNumber,JobTitle,BirthDate 
FROM AdventureWorks.HumanResources.Employee
SELECT DISTINCT JobTitle
FROM AdventureWorks.HumanResources.Employee

WHERE (LIKE, IN, BETWEEN; AND, OR, NOT)

What if we want to see records satisfying certain conditions?

We can filter using the WHERE clause.

Syntax

SELECT column1, column2, ... , columnN FROM table_name
WHERE <condition>

Within the WHERE clause, we can use operators:

  • =
  • <
  • >
  • >=
  • <=
  • <>, which is not equal.
  • LIKE, which can do a wildcard search for a specific pattern.
    There are two wildcard holders:
    – %: used to represent >= 0 characters.
    – _ : represents a single character.
    For example, column1 LIKE ‘%data%’ (search for records with column1’s value contains the word ‘data’).
  • IN, which can specify a set of possible values.
    For example, column1 IN (value1, value2, …)
  • BETWEEN, which can specify a range of values.
    For example, column1 BETWEEN value1 AND value2.

We can use logic operators AND, OR, NOT to combine statements as well.

Example

Question: Find a list of employees that are salaried, with job title starts with ‘Sales’ and vacation hours over 15.

sql server sample database adventureworks diagrams employee table
Source: AdventureWorks database diagram

Answer:

SELECT * 
FROM AdventureWorks.HumanResources.Employee 
WHERE SalariedFlag = 1
--the % is used to represent >= 0 characters; _ represents a single character; we can use it to form different patterns 
AND JobTitle LIKE 'Sales%' 
AND VacationHours > 15

ORDER BY

What if we want to sort the result by specific columns?

We can use the ORDER BY keyword.

Syntax

SELECT * 
FROM table_name
ORDER BY column1, column2, ... , columnN  ASC/DESC
-- ASC for ascending order, DESC for descending order

Example

Question: List all the sales orders that are online and sort them by sub total from largest to smallest.

sql server sample database adventureworks diagrams salesorderheader
Source: AdventureWorks database diagram

Answer:

SELECT * 
FROM AdventureWorks.Sales.SalesOrderHeader 
WHERE OnlineOrderFlag = 1
ORDER BY SubTotal DESC

SELECT TOP

What if we only want to look at a top list from the table?

We can use the TOP Clause.

This is also often used when we just want to take a look at the table.

Syntax

SELECT TOP <N> column1, column2, ... , columnN 
FROM table_name

Example

Question: List the top 10 salespeople by Sales YTD.

sql server sample database adventureworks diagrams person table
Source: AdventureWorks database diagram

Answer:

SELECT TOP 10 * 
FROM AdventureWorks.Sales.SalesPerson 
ORDER BY SalesYTD DESC

Aggregate Functions

What if we want summary statistics for the tables?

We can use the aggregation functions.

Syntax

SELECT AggFunction0(column1), AggFunction1(column2), ... 
FROM table_name 
WHERE <condition>

The popular functions are:

  • MIN: minimum of the values of the column
  • MAX: maximum of the values of the column
  • AVG: average of the values of the column
  • COUNT: number of records of the column
  • SUM: sum of values of the column

Example

Question: What are the minimum, maximum, and average ages of all employees.

sql server sample database adventureworks diagrams employee table
Source: AdventureWorks database diagram

Answer:

SELECT
-- DATEDIFF is also a function, which returns the difference between two date values
-- we can use the AS to name variable name, which is not mandatory 
MIN(DATEDIFF(year, BirthDate, '2020-04-30')) AS min_age, 
MAX(DATEDIFF(year, BirthDate, '2020-04-30')) AS max_age, 
AVG(DATEDIFF(year, BirthDate, '2020-04-30')) AS avg_age 
FROM AdventureWorks.HumanResources.Employee

GROUP BY (HAVING)

How to get summary statistics for different groups in the data?

We often use the Aggregation functions together with GROUP BY.

Syntax

Select <groupby columns>, AggFunction0(ValueColumn0), AggFunction1(ValueColumn0), ... 
FROM table_name 
GROUP BY <groupby columns>
HAVING <conditions> 
-- we don't need to put <groupby columns> in the columns section, but it is almost always included.

We can also use HAVING with the GROUP BY queries, which is similar to the WHERE statement. Since WHERE can’t be used to filter conditions for aggregate functions.

Example

Question: How much net profit (Sales – Cost) did the different Sales Territories (CountryRegionCode) make last year, from largest to smallest?

sql server sample database adventureworks diagrams sales territory table
Source: AdventureWorks database diagram

Answer:

SELECT CountryRegionCode, 
SUM(SalesLastYear - CostLastYear) as ProfitLastYear 
FROM AdventureWorks.Sales.SalesTerritory 
GROUP BY CountryRegionCode
--we can also use HAVING to specify a threshold of net profit, eg., over 5000000
--HAVING SUM(SalesLastYear - CostLastYear) > 5000000
ORDER BY SUM(SalesLastYear - CostLastYear) DESC

Related article: How to GroupBy with Python Pandas Like a Boss

SELECT CASE

We can also use the CASE statement, which is similar to IF-ELSE statements in other programming languages.

This is especially useful when we want to customize the groups and provide summary statistics. Check out the example for details.

Syntax

SELECT CASE 
       WHEN <condition0> THEN <value0>
       WHEN <condition1> THEN <value1>
        ...
       ELSE <value>
       END as <columnName>
       ... 
FROM table_name
       ...

The CASE statement goes through conditions and returns a value in order. WHEN the condition0 is met, it returns value0. Otherwise, it moves on to the next WHEN condition, and so on.

Once a condition is true, it will stop moving down and return the result. If none of the conditions are true, it returns the value under the ELSE. If there is no ELSE as well, it returns NULL.

Example

Question: Count how many Employees are between 0 and 20 years old, 20 and 40 years old, 40 and 60, 60+.

sql server sample database adventureworks diagrams employee table
Source: AdventureWorks database diagram

Answer:

SELECT CASE WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 20 THEN '[0 - 20) years old'
WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 40 THEN '[20 - 40) years old'
WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 60 THEN '[40 - 60) years old'
ELSE '60+ years old'
END as age_cat,
COUNT(*) as cnt
FROM AdventureWorks.HumanResources.Employee
GROUP BY 
CASE WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 20 THEN '[0 - 20) years old'
WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 40 THEN '[20 - 40) years old'
WHEN DATEDIFF(year, BirthDate, '2020-04-30') < 60 THEN '[40 - 60) years old'
ELSE '60+ years old'
END

SQL JOINs

So far, we’ve been extracting information from one table. While most relational databases in production have information separated into different tables.

Each table usually contains information for a specific purpose. For example, we have sales tables, customer tables, product tables in our sample database.

If we want to combine data from multiple tables easily, we must use JOIN.

Imagine we have table 1 and table 2. There are 4 main ways we want data from them (source: W3 school):

  • (INNER) JOIN: Returns records that have matching values in both tables
  • LEFT (OUTER) JOIN: Returns all records from the left table, and only the matched records from the right table
  • RIGHT (OUTER) JOIN: Returns all records from the right table, and only the matched records from the left table
  • FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table

It is easier to understand with the Venn diagrams below:

sql joins explained venn diagram
Source: W3 schools

Syntax

When using JOINS, we must use the columns of the same definition from the two tables as identifiers. This is how we link the two tables together.

SELECT table1.column1, table2.column1, ...
FROM 
table 1 AS A
(INNER/LEFT/RIGHT/FULL) JOIN
table 2 AS B
ON
A.Key = B.Key --key identical variables to join them together, they might have different column names

Examples

Question: We have two tables HumanResources.Employee and Person.Person. Both tables have key columns BusinessEntityID.

How can we get the loginID, job title, birth date, first name, last name of all the employees?

Answer:

SELECT emp.LoginID, emp.JobTitle, emp.BirthDate, contact.FirstName, contact.LastName
-- HumanResources.Employee emp is the same as HumanResources.Employee AS emp, we can ignore the AS
FROM AdventureWorks.HumanResources.Employee emp
-- we use left join since we want all the employees, which is from the left table
LEFT JOIN AdventureWorks.Person.Person contact 
on (emp.BusinessEntityID = contact.BusinessEntityID)

The LEFT JOIN will return all the rows from the HumanResources.Employee table, no matter whether there’s a match in the Person.Person table.

We can also join multiple tables at once. Let’s see how it works.

Question: List the sales amount by products for shipment to city Bothell.

It sounds like a simple question. But the information is from multiple tables:

  • The SalesOrderHeader has the orders basic information.
  • The Address has the city information.
  • The SalesOrderDetail has the LineTotal, which we assume is the amount of sales.
  • The Product has the product information.

How do we piece them together?

Source: AdventureWorks database diagram

Answer:

Luckily, these tables have common keys we can join them together.

SELECT 
adds.City, prod.name, 
SUM(sod.LineTotal) as product_sales
FROM
-- to simplify, we are using all inner join to get the information that are available in all tables 
AdventureWorks.Sales.SalesOrderHeader soh 
INNER JOIN AdventureWorks.Person.Address adds 
on (soh.ShipToAddressID = adds.AddressID)
INNER JOIN AdventureWorks.Sales.SalesOrderDetail sod 
on (soh.SalesOrderID = sod.SalesOrderID)
INNER JOIN AdventureWorks.Production.Product prod 
on (sod.ProductID = prod.ProductID)
WHERE adds.City = 'Bothell'
GROUP BY adds.City, prod.name
ORDER BY SUM(sod.LineTotal) DESC

SQL Subqueries

SQL can execute nested queries. A subquery is one SELECT statement that is embedded within another SELECT statement. The innermost subquery gets executed first, then the next level.

This is when SQL starts getting more complicated.

A good way of solving a complex problem is to break down into smaller questions/queries and piece them together.

Syntax

The syntax is the same as the basic SELECT statement.

Example

Question: Find all the sales details of the customer that spent the most money.

Answer:

As mentioned, it’s good to breakdown the problem.

First, we need to know which tables have the information we need:

  • Is it all in one table?
  • Or it’s in multiple tables that I’ll need to JOIN.

Then, we might need to write subqueries first. For example, I wrote the subquery to grab the TOP 1 customer that spent the most money.

Then, we can piece things together.

SELECT sod.* 
FROM AdventureWorks.Sales.SalesOrderDetail sod 
INNER JOIN AdventureWorks.Sales.SalesOrderHeader soh1 
ON (sod.SalesOrderID = soh1.SalesOrderID)
WHERE soh1.CustomerID IN
-- the below subquery looks for the top 1 customer by totaldue amount
(SELECT TOP 1 soh.CustomerID
FROM AdventureWorks.Sales.SalesOrderHeader soh
GROUP BY soh.CustomerID
ORDER BY SUM(soh.TotalDue) DESC)

SQL Window Functions

Lastly, let’s talk about the SQL window functions.

We haven’t had to use window functions much during our data analysis with SQL. Most of the time, you can get away using other queries without it. But it is still handy.

Determines the partitioning and ordering of a rowset before the associated window function is applied. That is, the OVER clause defines a window or user-specified set of rows within a query result set.

A window function then computes a value for each row in the window.

You can use the OVER clause with functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results.

OVER clause documentation

Syntax

OVER (   
       [ <PARTITION BY clause> ]  
       [ <ORDER BY clause> ]   
       [ <ROW or RANGE clause> ]  
      ) 

For details about syntax, read the OVER clause documentation.

It is easier to explain with examples. You may read the examples first and use the documentation to understand the full picture.

Examples

Question: Find the latest sales order for each credit card ID within the table below.

sql server sample database adventureworks diagrams salesorderheader table
Source: AdventureWorks database diagram

Answer:

SELECT * FROM
(
--get a table with sales and the rank by orderdate for each creditcardID
SELECT c.*, RANK () OVER (
                          PARTITION BY c.CreditCardID --similar to group by
                          ORDER BY c.OrderDate DESC --order by orderdate to get the RANK()
                          ) AS OrderDate_rank
FROM AdventureWorks.Sales.SalesOrderHeader c
) windows_function
WHERE windows_function.OrderDate_rank = 1 
--get the number 1 rank based on the most recent orderdate

Question: Find the top 3 customers that spent the most for the product ID of 765.

Answer:

Again, it’s better to breakdown the complex problem into smaller questions first:

  • Step #1: How do we get the customers and their spending who have purchased productID = 765
  • Step #2: How do we get the rank of customers by spending
  • Step #3: How do we get the top rank customers

When reading and testing the code, you can also run the subqueries that solved the first two questions to check.

--Step #3: get the top rank customer using WHERE 
SELECT * FROM
(
--Step #2: get the rank of top customers by spending using a window function
SELECT 
cps.ProductID, cps.CustomerID, cps.product_spend, RANK () OVER ( 
        PARTITION BY cps.ProductID
        ORDER BY cps.product_spend DESC
        ) AS customer_rank
FROM
--Step #1: get the customers and their spending who have purchased productID = 765 by joining multiple tables
(SELECT 
prod.ProductID, soh.CustomerID, SUM(sod.LineTotal) as product_spend
FROM AdventureWorks.Production.Product prod
LEFT JOIN 
AdventureWorks.Sales.SalesOrderDetail sod
ON (prod.ProductID = sod.ProductID)
LEFT JOIN
AdventureWorks.Sales.SalesOrderHeader soh
ON (soh.SalesOrderID = sod.SalesOrderID)
WHERE prod.ProductID = 765
GROUP BY prod.ProductID,soh.CustomerID) cps
) a
WHERE a.customer_rank <= 3

Related article: How to GroupBy with Python Pandas Like a Boss
Window functions are similar to the transform functions in Python Pandas.


Bonus Tips

Always Breakdown the Complicated Problems

This tip is already mentioned in the article. But it is critical, so we want to reiterate.

It is easier to breakdown the problem into smaller questions. And write queries to solve the smaller questions before combining them.

SQL Queries Order

SQL evaluates the queries clauses in the order of:

  1. FROM (JOIN)
  2. WHERE
  3. GROUP BY
  4. HAVING
  5. ORDER BY

Not every query needs to have all of the above. When it does, be aware of the order!


That’s it for this SQL tutorial for beginners! You should have mastered all you need to learn about SQL queries for data analysis!

To learn about SQL database management commands, check out Quick SQL Database Tutorial for Beginners.

Leave a comment for any questions you may have or anything else.

Related “Break into Data Science” articles:

How to Learn Data Science Online: ALL You Need to Know

Check out this for a detailed review of resources online, including coursesbooksfree tutorialsportfolios building, and more.

Python crash course: breaking into Data Science

What are the In-Demand Skills for Data Scientists in 2020
In this article, we extracted information from Indeed job postings. You’ll see why we recommend setting up the data science environment with SQL and Python.

Twitter
LinkedIn
Facebook
Email
Lianne & Justin

Lianne & Justin

Leave a Comment

Your email address will not be published. Required fields are marked *

More recent articles

Scroll to Top

Learn Python for Data Analysis

with a practical online course

lectures + projects

based on real-world datasets

We use cookies to ensure you get the best experience on our website.  Learn more.