SQL Sales Data Analysis Project
Description
In this project, we aim to provide deeper insights into sales data by calculating the moving average over a specific time period for each product category. This requires mastery of SQL subqueries and window functions. We start by extracting relevant sales data and then move on to the calculation of moving averages using window functions. The output will be a refactored SQL code that optimizes the calculation and presentation of the moving averages by individual product categories.
The original prompt:
Can you refactor this code for me and advise why you did it
-- Calculate moving average of sales data over a specific time period for each product category using subqueries and window functions
WITH SalesData AS ( SELECT p.product_category, s.sale_date, s.sale_amount, AVG(s.sale_amount) OVER(PARTITION BY p.product_category ORDER BY s.sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM sales_table s JOIN products p ON s.product_id = p.product_id )
SELECT product_category, sale_date, sale_amount, moving_avg FROM SalesData ORDER BY product_category, sale_date;
SQL Subqueries
Introduction
A subquery is a query that is embedded in the WHERE or HAVING clause of another SQL query. Subqueries can return individual values or a list of records; Subqueries must be enclosed with parenthesis and are always executed first, and the result of the subquery is passed on to the main query.
Let's consider a general sales table named SalesData
with fields [Order_ID, Product, Category, Amount, Date, Country].
Simple Subquery Example:
A simple example where we want to find the orders where the amount is above the average order amount. Here, the subquery calculates the average amount first and then the main query uses this calculated average to filter out the orders.
SELECT Order_ID
FROM SalesData
WHERE Amount > (SELECT AVG(Amount) FROM SalesData)
Subqueries with IN
Operator:
Now, suppose we are analyzing sales data for multiple categories and we are interested in a particular subset of categories. We can use a subquery with the IN
operator
SELECT Order_ID, Product, Amount
FROM SalesData
WHERE Category IN (SELECT Category
FROM SalesData
WHERE Category IN ('Electronics', 'Clothing'));
Subqueries in the FROM
clause:
Also, you can use subqueries in the FROM
clause. This is often used when you want to perform an operation on a result set in its entirety.
SELECT AVG(Total_Amount), Category
FROM
(SELECT Category, SUM(Amount) as Total_Amount
FROM SalesData
GROUP BY Category) as Category_Sum
GROUP BY Category
Window Functions
Window functions perform a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
Commonly used window functions include RANK()
, DENSE_RANK()
, ROW_NUMBER()
, and aggregates like SUM()
, COUNT()
, AVG()
, MIN()
, MAX()
when used with the OVER
clause.
Let's consider an example where you want to rank orders within each category based on the amount.
RANK()
Usage:
SELECT Category, Order_ID, Amount,
RANK () OVER (
PARTITION BY Category
ORDER BY Amount DESC
) Order_Rank
FROM SalesData;
In this query, the RANK()
function is used to rank the orders within each Category
based on the Amount
. Note that if two orders have the same amount in a category, they will get the same rank, and the next amount will be assigned a rank assuming the tie position was occupied by two different ranks.
ROW_NUMBER()
Usage:
ROW_NUMBER()
function can be used to assign a unique sequential integer to rows within each PARTITION
of the result set. The unique number or row number is reset for each partition.
SELECT Category, Order_ID, Amount,
ROW_NUMBER () OVER (
PARTITION BY Category
ORDER BY Amount DESC
) Order_Sequence
FROM SalesData;
In this example, the row_number function has been used to assign a unique order sequence within each product category, ordered by the amount in descending order.
By using SQL subqueries and window functions, you can perform complex data analysis and data manipulation tasks.
Understanding Window Functions in SQL
Window functions perform calculations across a set of rows related to the current row. This is comparable to an aggregate function, which performs a calculation on a set of rows and returns a single row. But, a window function does not produce a group result like aggregate functions. It returns a result for each row, from the underlying query.
The window functions allow users to perform calculations on sets of rows with more flexibility than other aggregations. By defining rows within a certain "window" that the function will process.
Syntax Structure
The structure of a window function is as follows:
WINDOW_FUNCTION (expression) OVER (
[PARTITION BY column_list_1]
[ORDER BY column_list_2]
[ROW|RANGE frame_start_option frame_end_option]
)
- WINDOW_FUNCTION: Any compatible functions like count, sum, avg, row_number, rank, and dense rank.
- expression: This could be a column or a formula.
- PARTITION BY: This can divide up your data set into sets or partitions, on which, the window function runs separately for each partition.
- ORDER BY: It determines the order of rows within each partition to which the function is applied.
- ROW|RANGE: Determines the range of input rows to be used for each row's window frame.
Example Uses
Let's say we have the following sales_information
table:
sales_rep_id | product | quarter | sales |
---|---|---|---|
1 | Product A | Q1 | 10000 |
1 | Product B | Q1 | 15000 |
1 | Product A | Q2 | 20000 |
1 | Product B | Q2 | 10000 |
2 | Product A | Q1 | 15000 |
2 | Product B | Q1 | 15000 |
2 | Product A | Q2 | 15000 |
2 | Product B | Q2 | 15000 |
Cumulative Sum
This is a type of sum where you add the current value to all the previously computed values.
Let's illustrate this by investigating the cumulative sales for a sales representative over the quarters using the SUM
function.
SELECT
sales_rep_id,
product,
quarter,
sales,
SUM(sales) OVER(PARTITION BY sales_rep_id ORDER BY quarter) as cumulative_sales
FROM
sales_information
ORDER BY
sales_rep_id,
quarter;
This SQL query will create a running total of sales for each sales representative, partitioned by their id number and ordered by the quarter.
Rank
Ranking assigns a unique rank integer to each row within a partition of a result set.
Let's see how to obtain the rank of sales representatives based on their sales using the RANK
function.
SELECT
sales_rep_id,
product,
quarter,
sales,
RANK() OVER(PARTITION BY sales_rep_id ORDER BY sales DESC) as sales_rank
FROM
sales_information
ORDER BY
sales_rep_id,
sales_rank;
The above query outputs the rank of each sales representative according to the sales for each product they sold within each quarter.
Conclusion
SQL window functions are a powerful feature that allows us to perform complex computations across rows of a dataset without having to write complex subqueries or aggregate queries. They’re highly versatile and can be used in a variety of scenarios, from calculation of running totals to data smoothing and even machine learning prediction problems.
Implementation of SQL Subqueries and Window Functions for Sales Data Analysis
Section 1. Extraction of Sales Data
Before we can employ SQL subqueries and window functions, we first need to extract the required data. Usually, sales data is stored in a relatable manner across various tables in a database. So, to begin with, let's assume we have three primary tables - sales
, products
, and customers
.
sales
table includes columns likesales_id
,product_id
,customer_id
,sales_date
,quantity
, andamount
.products
table includes columns likeproduct_id
,product_name
,category
, andprice
.customers
table includes columns likecustomer_id
,first_name
,last_name
,email
, andcity
.
Let us start by extracting sales data and merging necessary details from products
and customers
tables.
SELECT s.sales_id, s.product_id, p.product_name, p.category, p.price,
s.customer_id, c.first_name, c.last_name, c.email, c.city,
s.sales_date, s.quantity, s.amount
FROM sales AS s
INNER JOIN products AS p ON s.product_id = p.product_id
INNER JOIN customers AS c ON s.customer_id = c.customer_id
ORDER BY s.sales_date;
This SQL query extracts data from the three tables and merges based on product_id
and customer_id
.
Section 2. Implementing SQL Subquery in Sales Data Analysis
For example, if we want to find which customers have purchased the most products or the product with the most sales, we can use a subquery in the WHERE clause.
Example 1: Find customers who have purchased the most products
SELECT c.first_name, c.last_name, c.email, COUNT(s.sales_id) AS total_purchases
FROM customers AS c
INNER JOIN sales AS s ON c.customer_id = s.customer_id
WHERE c.customer_id IN (
SELECT customer_id
FROM sales
GROUP BY customer_id
ORDER BY COUNT(sales_id) DESC
LIMIT 1)
GROUP BY c.customer_id;
Section 3. Using SQL Window Functions in Sales Data Analysis
Window functions allow us to perform calculations across a set of table rows, called a 'window', that are related to the current row.
Example 1: Find the running total of sales amount each day
SELECT sales_date, amount,
SUM(amount) OVER (ORDER BY sales_date) AS running_total
FROM sales;
Example 2: Find the highest sales amount each day
SELECT sales_date, product_id, amount,
MAX(amount) OVER (PARTITION BY sales_date) AS daily_max_sales
FROM sales;
Please note that these are generic examples and changes might be required based on the actual table structure and business logic.
Calculating Moving Averages in SQL
Moving average, also known as rolling or running average, is a kind of finite impulse response. It's used frequently in data analysis to smooth out short-term fluctuations and highlight longer-term trends or cycles.
This section will explain how to compute the moving averages of the sales data using SQL Window function.
Prepare the Sales Data
Assume that we have the following sales
table:
sales_date | sales_amount |
---|---|
2020-01-01 | 100 |
2020-01-02 | 120 |
2020-01-03 | 150 |
... | ... |
The structure of the sales
table can be depicted as the following:
CREATE TABLE sales
(
sales_date DATE,
sales_amount INT
);
The actual sales data would presumably contain many more records.
Calculating Moving Averages
To compute the moving average, SQL provides an out-of-the-box function AVG()
in conjunction with window function OVER()
. In the following example, we shall calculate a 7-day moving average.
SELECT sales_date,
sales_amount,
AVG(sales_amount) OVER (
ORDER BY sales_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_average
FROM sales
ORDER BY sales_date;
Let's break down this SQL query:
AVG(sales_amount) OVER (ORDER BY sales_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
: This expression calculates the averagesales_amount
over a 'window' of records. The 'window' of records is defined byORDER BY sales_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
which means the current record and the 6 preceding records (thereby creating a 7-day period for averaging).
After executing the query, the resulting output will look as follows:
sales_date | sales_amount | moving_average |
---|---|---|
2020-01-01 | 100 | 100.00 |
2020-01-02 | 120 | 110.00 |
2020-01-03 | 150 | 123.33 |
... | ... | ... |
This result set includes the sales_date
, the sales_amount
for that date, and a moving_average
field that shows the average sales amount over the preceding 7 days.
Keep in mind that the ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
clause controls the amount of records considered in the moving average calculation - you can adjust this to calculate a different day period moving average (e.g., ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
for a 30-day moving average).
Optimizing and Refactoring SQL Code
Section 1: Identifying areas for optimization
1.1 Avoidance of Redundant Subqueries
Subqueries can sometimes result in inefficient SQL queries. This is because the subquery is often executed separately for each row checked by the outer query. An example might be:
SELECT a.CustomerID, a.Sales, a.Date
FROM Sales a
WHERE a.Sales > (SELECT AVG(b.Sales) FROM Sales b WHERE a.date = b.date);
This subquery could be optimized by using a window function to pre-compute the average sales:
SELECT a.CustomerID, a.Sales, a.Date
FROM (
SELECT CustomerID, Sales, Date, AVG(Sales) OVER (PARTITION BY date) as AvgSales
FROM Sales) a
WHERE a.Sales > a.AvgSales;
The inner query creates a new column AvgSales
using the window function AVG(Sales) OVER (PARTITION BY date)
which computes the average sales per date. This reduces the computation cost compared to the previous method where for each row the average sales are computed separately.
1.2 Inefficient Joins
Inefficient use of JOIN operations can slow down joined query performance significantly:
SELECT a.CustomerID, a.Sales, b.Country
FROM Sales a LEFT JOIN Customers b ON a.CustomerID = b.CustomerID
WHERE b.Country = "Germany";
Depending on the data distribution and index availability, the "JOIN then filter" approach may lead to unnecessary computation. A better approach might be to filter first:
SELECT a.CustomerID, a.Sales, b.Country
FROM Sales a
LEFT JOIN (SELECT * FROM Customers WHERE Country = "Germany") b ON a.CustomerID = b.CustomerID;
Section 2: Refactoring SQL Code
2.1 Conversion of Row Subqueries To JOINs
Row subqueries returning multiple columns can be refactored to JOIN operations. For instance:
SELECT a.*
FROM Sales a
WHERE (a.Date, a.CustomerID)
IN (SELECT b.Date, b.CustomerID
FROM Sales b
WHERE b.Sales > 500);
This can be refactored as:
SELECT a.*
FROM Sales a
INNER JOIN (SELECT Date, CustomerID FROM Sales WHERE Sales > 500) b
ON (a.Date = b.Date AND a.CustomerID = b.CustomerID);
2.2 Use of Window Functions Instead of Self-Joins
Self-joins can be heavy and cause performance issues. Where possible, window functions should be used. For example:
SELECT a.CustomerID, a.Sales - b.Sales AS SalesDiff
FROM Sales a JOIN Sales b ON a.CustomerID = b.CustomerID AND a.Date = b.Date - INTERVAL '1 day';
This query is used for calculating the daily sales difference. This can be refactored using window functions:
SELECT CustomerID,
Sales - LAG(Sales) OVER (PARTITION BY CustomerID ORDER BY Date) AS SalesDiff
FROM Sales;
Conclusion
Optimizing and refactoring in SQL code requires careful thought and thorough re-evaluation of your approaches to database queries. It's important to mind the cost of each operation, such as subqueries and joins, and to refactor code in such a way that makes it more efficient and easier to understand. These two aspects should be a target during your SQL code writing for an efficient data analysis process.
An In-Depth Guide to Understanding and Implementing SQL Subqueries and Window Functions for Sales Data Analysis - Part 7
The SELECT Statement in SQL
The SELECT
statement in SQL is employed to select data from a database. The data extracted from a database using a SELECT
statement is stored in a result table, also known as a result-set.
Results are returned based on the criteria defined in the SELECT
statement for database queries.
Here is the basic syntax for a SELECT
statement:
SELECT column1, column2, columnN FROM table_name;
This will fetch all the records from the specified columns in the table.
If you want to select all the available columns in a table (i.e., retrieve all records), you can replace the column names with the star symbol (*
):
SELECT * FROM table_name;
SQL SELECT DISTINCT Statement
In a table, a column may hold many duplicate values; and sometimes you'll only want to list the different (distinct) values. The DISTINCT
keyword can be used in the SELECT
statement to return only distinct (unique) values.
Here's the syntax for using the DISTINCT
keyword:
SELECT DISTINCT column1, column2, ..., columnN FROM table_name;
SQL WHERE Clause
The WHERE
Clause is used to filter records. It is used to extract only those records that fulfill a specified condition.
Here's the general syntax of a WHERE
clause:
SELECT column1, column2, ..., columnN
FROM table_name
WHERE CONDITION;
You can use logical operators like =
, >=
, <=
, <>
and arithmetic operators like +
, -
, /
, *
, %
in the condition.
Using SELECT with Subqueries and Window Functions
When executing complex queries that involve calculations or depend on multiple data sets, SELECT
can be employed with subqueries and window functions.
Essentially, a subquery is a query within another query where a query is defined to retrieve data or information back from the database.
For instance, in a sales database, if we needed to obtain the total sales per region from a sales_table, a subquery would be useful:
SELECT region,
(SELECT SUM(sales_amount) FROM sales_table WHERE sales_table.region = outer.region) as region_total
FROM sales_table as outer;
Window functions are used for complex calculations that require iteration through rows. They do not cause rows to become grouped into a single output row like the aggregation functions do, so they are very helpful when dealing with data analysis.
For example, to compute running totals of sales_amount in the sales_table:
SELECT region, sales_amount, SUM(sales_amount) OVER(ORDER BY sales_date) as running_total
FROM sales_table;
In this example, SUM OVER(ORDER BY sales_date)
is the window function which gives out the running total of sales_amount sorted by sales_date, and this data will be displayed alongside the usual SELECT
output.
Remember, when using SELECT
, be sure to have the necessary permissions to access or modify the data in the database. Retrieving data must also follow the privacy and data handling policies within your organization or team.