SQL-Based Data Analysis for Retail Sales
Description
The goal of this project is to leverage SQL for data extraction, transformation, and analysis focused on retail sales data. This project will help identify sales patterns, the effect of promotions, and the impact of economic indices. The analysis will produce subsets of data suitable for visualization, enabling better decision-making.
The original prompt:
Based on this dataset can you work out various types of analysis that can be completed and then create specific SQL code to get subsets of data that could be turned into charts to highlight the analysis.
I'm looking for a range of SQL code that could be easily used for the analysis of this dataset
Date Store Category Sales Economic_Index Promotion 1/1/2019 Store A Electronics 412 99.44954273 0 1/2/2019 Store A Electronics 202 98.85525499 0 1/3/2019 Store A Electronics 299 101.1858996 0 1/4/2019 Store A Electronics 181 98.69758959 1 1/5/2019 Store A Electronics 490 97.86122082 1 1/6/2019 Store A Electronics 306 103.6332865 1 1/7/2019 Store A Electronics 327 99.49102562 1 1/8/2019 Store A Electronics 197 102.1630683 0 1/9/2019 Store A Electronics 252 92.6185201 0 1/10/2019 Store A Electronics 413 98.16149309 0 1/11/2019 Store A Electronics 219 98.27721518 1 1/12/2019 Store A Electronics 188 104.2501821 0 1/13/2019 Store A Electronics 455 101.9019185 0 1/14/2019 Store A Electronics 169 103.4198824 0 1/15/2019 Store A Electronics 472 99.72160068 0
Setting Up the Database
1. Create Database
CREATE DATABASE RetailStoreDB;
2. Use the Database
USE RetailStoreDB;
3. Create Tables
Customers Table
CREATE TABLE Customers (
CustomerID INT PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
Email VARCHAR(100),
Phone VARCHAR(15),
Address VARCHAR(255),
City VARCHAR(50),
State VARCHAR(50),
ZipCode VARCHAR(10)
);
Products Table
CREATE TABLE Products (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
Category VARCHAR(50),
Price DECIMAL(10, 2),
StockQuantity INT
);
Orders Table
CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
CustomerID INT,
OrderDate DATE,
TotalAmount DECIMAL(10, 2),
FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);
OrderDetails Table
CREATE TABLE OrderDetails (
OrderDetailID INT PRIMARY KEY,
OrderID INT,
ProductID INT,
Quantity INT,
Price DECIMAL(10, 2),
FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);
4. Insert Sample Data
Insert Customers
INSERT INTO Customers (CustomerID, FirstName, LastName, Email, Phone, Address, City, State, ZipCode)
VALUES
(1, 'John', 'Doe', 'john.doe@example.com', '555-1234', '123 Elm St', 'Anytown', 'CA', '12345'),
(2, 'Jane', 'Smith', 'jane.smith@example.com', '555-5678', '456 Oak St', 'Anycity', 'TX', '67890');
Insert Products
INSERT INTO Products (ProductID, ProductName, Category, Price, StockQuantity)
VALUES
(1, 'Laptop', 'Electronics', 999.99, 50),
(2, 'Headphones', 'Electronics', 199.99, 200);
Insert Orders
INSERT INTO Orders (OrderID, CustomerID, OrderDate, TotalAmount)
VALUES
(1, 1, '2023-10-01', 1199.98),
(2, 2, '2023-10-03', 199.99);
Insert OrderDetails
INSERT INTO OrderDetails (OrderDetailID, OrderID, ProductID, Quantity, Price)
VALUES
(1, 1, 1, 1, 999.99),
(2, 1, 2, 1, 199.99),
(3, 2, 2, 1, 199.99);
Data Ingestion
-- Creating the table to ingest sales data
CREATE TABLE sales_data (
sale_id INT PRIMARY KEY,
date DATE,
store_id INT,
product_id INT,
quantity_sold INT,
sales_amount DECIMAL(10, 2)
);
-- Example of inserting some data entries
INSERT INTO sales_data (sale_id, date, store_id, product_id, quantity_sold, sales_amount) VALUES
(1, '2023-01-01', 101, 201, 10, 150.00),
(2, '2023-01-02', 102, 202, 15, 225.00),
(3, '2023-01-03', 103, 203, 8, 120.00),
(4, '2023-01-04', 104, 204, 20, 300.00),
(5, '2023-01-05', 105, 205, 5, 75.00);
-- Ingestion process could involve loading data from an external file like CSV
-- Using the COPY command in PostgreSQL as an example
-- Make sure the file path and column mapping correspond to your setup
COPY sales_data (sale_id, date, store_id, product_id, quantity_sold, sales_amount)
FROM '/path/to/your/sales_data.csv'
DELIMITER ','
CSV HEADER;
-- Verification
SELECT * FROM sales_data LIMIT 10;
-- You can also use an INSERT INTO ... SELECT to load data from another table
INSERT INTO sales_data (sale_id, date, store_id, product_id, quantity_sold, sales_amount)
SELECT sale_id, date, store_id, product_id, quantity_sold, sales_amount
FROM another_table;
Assumptions
- Data formats align with the structure specified.
- Paths to files and file formats are set correctly.
- Necessary permissions and correct paths are in place for file ingestion.
SQL Data Cleaning and Preparation
1. Remove Duplicate Records
DELETE FROM sales_data
WHERE id IN (
SELECT id
FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION BY sale_id ORDER BY id) AS row_num
FROM sales_data
) t
WHERE t.row_num > 1
);
2. Handle Missing Values
- For numeric columns
UPDATE sales_data
SET column_name = 0
WHERE column_name IS NULL;
- For string columns
UPDATE sales_data
SET column_name = 'Unknown'
WHERE column_name IS NULL;
3. Correct Inconsistent Data Formats
- Convert date formats
UPDATE sales_data
SET sale_date = STR_TO_DATE(sale_date, '%m/%d/%Y')
WHERE STR_TO_DATE(sale_date, '%m/%d/%Y') IS NOT NULL;
4. Standardize Categorical Values
UPDATE sales_data
SET category = 'Electronics'
WHERE category IN ('electronics', 'ELECTRONICS', 'elec');
5. Remove Outliers
DELETE FROM sales_data
WHERE sale_amount > (
SELECT AVG(sale_amount) + 3 * STDDEV(sale_amount)
FROM sales_data
);
6. Index Columns for Better Performance
CREATE INDEX idx_sale_date ON sales_data (sale_date);
CREATE INDEX idx_product_id ON sales_data (product_id);
CREATE INDEX idx_customer_id ON sales_data (customer_id);
7. Normalize Data
- For example, normalizing sale amount
UPDATE sales_data
SET sale_amount = (sale_amount - (SELECT AVG(sale_amount) FROM sales_data)) /
(SELECT STDDEV(sale_amount) FROM sales_data);
8. Create Cleaned and Prepared Table
CREATE TABLE cleaned_sales_data AS
SELECT *
FROM sales_data
WHERE sale_date IS NOT NULL
AND product_id IS NOT NULL
AND customer_id IS NOT NULL;
9. Add Constraints
- Ensure no duplicate sales
ALTER TABLE cleaned_sales_data
ADD CONSTRAINT unique_sale UNIQUE (sale_id);
- Add foreign key constraints
ALTER TABLE cleaned_sales_data
ADD CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(id);
ALTER TABLE cleaned_sales_data
ADD CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(id);
Basic Descriptive Analysis in SQL
Use the following SQL queries to perform basic descriptive analysis on sales data from a retail store. This analysis includes computing overall sales, average sales, total number of transactions, average quantity per transaction.
1. Overall Sales
To calculate the overall sales:
SELECT
SUM(sales_amount) AS overall_sales
FROM
sales_data;
2. Total Number of Transactions
To calculate the total number of transactions:
SELECT
COUNT(transaction_id) AS total_transactions
FROM
sales_data;
3. Average Sales per Transaction
To calculate the average sales per transaction:
SELECT
AVG(sales_amount) AS average_sales_per_transaction
FROM
sales_data;
4. Average Quantity per Transaction
To calculate the average quantity of items sold per transaction:
SELECT
AVG(quantity) AS average_quantity_per_transaction
FROM
sales_data;
5. Descriptive Statistics for Sales Amount
To get descriptive statistics like minimum, maximum, and average sales amount:
SELECT
MIN(sales_amount) AS min_sales_amount,
MAX(sales_amount) AS max_sales_amount,
AVG(sales_amount) AS avg_sales_amount
FROM
sales_data;
6. Descriptive Statistics for Quantity
To get descriptive statistics like minimum, maximum, and average quantity sold:
SELECT
MIN(quantity) AS min_quantity,
MAX(quantity) AS max_quantity,
AVG(quantity) AS avg_quantity
FROM
sales_data;
7. Sales by Product Category
To calculate the total sales for each product category:
SELECT
category,
SUM(sales_amount) AS total_sales
FROM
sales_data
GROUP BY
category;
8. Sales by Month
To calculate the sales for each month:
SELECT
DATE_FORMAT(sale_date, '%Y-%m') AS month,
SUM(sales_amount) AS total_sales
FROM
sales_data
GROUP BY
DATE_FORMAT(sale_date, '%Y-%m');
Remember, these SQL queries assume your table sales_data
includes fields like sales_amount
, transaction_id
, quantity
, category
, and sale_date
. Adapt field names as necessary.
-- Impact of Promotions
-- Analyzing the impact of promotions on sales in a retail store.
-- 1. Calculate sales before, during, and after promotion periods
-- Assuming sales table has these columns: sale_id, product_id, sale_date, sale_amount, promotion_id
-- Promotions table has these columns: promotion_id, promo_start_date, promo_end_date
-- Step 1: Create a temporary table to get results for each promotion period
WITH PromotionImpact AS (
SELECT
p.promotion_id,
p.promo_start_date,
p.promo_end_date,
-- Calculate sales during the promotion
SUM(CASE WHEN s.sale_date BETWEEN p.promo_start_date AND p.promo_end_date THEN s.sale_amount ELSE 0 END) AS sales_during_promo,
-- Calculate sales before the promotion (one month prior)
SUM(CASE WHEN s.sale_date BETWEEN DATEADD(month, -1, p.promo_start_date) AND DATEADD(day, -1, p.promo_start_date) THEN s.sale_amount ELSE 0 END) AS sales_before_promo,
-- Calculate sales after the promotion (one month after)
SUM(CASE WHEN s.sale_date BETWEEN DATEADD(day, 1, p.promo_end_date) AND DATEADD(month, 1, p.promo_end_date) THEN s.sale_amount ELSE 0 END) AS sales_after_promo
FROM
promotions p
LEFT JOIN
sales s ON p.promotion_id = s.promotion_id
GROUP BY
p.promotion_id, p.promo_start_date, p.promo_end_date
)
-- Step 2: Query to fetch and display the impact of promotions
SELECT
promotion_id,
promo_start_date,
promo_end_date,
sales_during_promo,
sales_before_promo,
sales_after_promo,
sales_during_promo - sales_before_promo AS increase_during_promo,
sales_after_promo - sales_before_promo AS post_promo_change
FROM
PromotionImpact;
This SQL script calculates the sales before, during, and after promotion periods to help assess the impact of promotions on sales.
-- Economic Index Correlation
-- Assuming the existence of a table `sales_data` with columns: `date`, `store_id`, `sales_amount`
-- and a table `economic_data` with columns: `date`, `economic_index`
-- Step 1: Join `sales_data` with `economic_data` on the date
WITH sales_economic AS (
SELECT
s.date,
s.store_id,
s.sales_amount,
e.economic_index
FROM
sales_data s
JOIN
economic_data e ON s.date = e.date
)
-- Step 2: Calculate monthly aggregated sales and average economic index
, monthly_data AS (
SELECT
DATE_TRUNC('month', date) AS month,
SUM(sales_amount) AS total_sales,
AVG(economic_index) AS avg_economic_index
FROM
sales_economic
GROUP BY
DATE_TRUNC('month', date)
)
-- Step 3: Calculate correlations between total sales and economic index
SELECT
CORR(total_sales, avg_economic_index) AS correlation
FROM
monthly_data;
The above SQL script computes the correlation between sales data and an economic index. Make sure your actual table and column names might differ, so replace them as needed. This script assumes the data is already cleaned and prepared for analysis.
Time Series Analysis using SQL
Create Time Series View for Daily Sales
CREATE VIEW daily_sales AS
SELECT
DATE(sale_date) AS sale_day,
SUM(sale_amount) AS total_sales
FROM
sales_data
GROUP BY
DATE(sale_date);
Monthly Aggregation
CREATE VIEW monthly_sales AS
SELECT
DATE_TRUNC('month', sale_day) AS sale_month,
SUM(total_sales) AS total_sales
FROM
daily_sales
GROUP BY
DATE_TRUNC('month', sale_day);
Yearly Aggregation
CREATE VIEW yearly_sales AS
SELECT
DATE_TRUNC('year', sale_day) AS sale_year,
SUM(total_sales) AS total_sales
FROM
daily_sales
GROUP BY
DATE_TRUNC('year', sale_day);
Year-over-Year Growth
CREATE VIEW yearly_growth AS
SELECT
sale_year,
total_sales,
LAG(total_sales) OVER (ORDER BY sale_year) AS previous_year_sales,
((total_sales - LAG(total_sales) OVER (ORDER BY sale_year)) / LAG(total_sales) OVER (ORDER BY sale_year)) * 100 AS yoy_growth
FROM
yearly_sales;
Moving Average (7-day)
CREATE VIEW moving_avg_7_days AS
SELECT
sale_day,
total_sales,
AVG(total_sales) OVER (ORDER BY sale_day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_sales_7_days
FROM
daily_sales;
Seasonal Decomposition
-- Assuming sales_data table has columns sale_date and sale_amount
-- Decompose into daily seasonality pattern
CREATE VIEW daily_seasonality AS
SELECT
EXTRACT(DOW FROM sale_day) AS day_of_week,
AVG(total_sales) AS avg_sales
FROM
daily_sales
GROUP BY
EXTRACT(DOW FROM sale_day);
Identify Trends
-- Identify trends by comparing moving averages with actual sales
CREATE VIEW trend_identification AS
SELECT
ds.sale_day,
ds.total_sales,
ma.avg_sales_7_days,
CASE
WHEN ds.total_sales > ma.avg_sales_7_days THEN 'Up'
WHEN ds.total_sales < ma.avg_sales_7_days THEN 'Down'
ELSE 'Stable'
END AS trend
FROM
daily_sales ds
INNER JOIN
moving_avg_7_days ma ON ds.sale_day = ma.sale_day;
Outlier Detection
-- Detect outliers using a simple standard deviation method
WITH sales_stats AS (
SELECT
AVG(total_sales) AS avg_sales,
STDDEV(total_sales) AS stddev_sales
FROM
daily_sales
)
SELECT
ds.sale_day,
ds.total_sales,
CASE
WHEN ds.total_sales > ss.avg_sales + 2 * ss.stddev_sales THEN 'High Outlier'
WHEN ds.total_sales < ss.avg_sales - 2 * ss.stddev_sales THEN 'Low Outlier'
ELSE 'Normal'
END AS outlier_status
FROM
daily_sales ds, sales_stats ss;
Forecasting (Simple Linear Regression)
-- Example of a simple forecasting model using linear regression
-- Intercept and slope calculations for total_sales vs sale_day
WITH regression AS (
SELECT
REGR_INTERCEPT(total_sales, EXTRACT(EPOCH FROM sale_day)) AS intercept,
REGR_SLOPE(total_sales, EXTRACT(EPOCH FROM sale_day)) AS slope
FROM
daily_sales
)
SELECT
ds.sale_day,
ds.total_sales,
r.intercept + r.slope * EXTRACT(EPOCH FROM ds.sale_day) AS forecasted_sales
FROM
daily_sales ds, regression r;
Visualization Preparation
Here are SQL queries to generate the datasets required for different types of visualizations. Execute these queries in your SQL environment to prepare the data for visualization:
1. Total Sales Per Month
SELECT
DATE_TRUNC('month', sale_date) AS month,
SUM(sale_amount) AS total_sales
FROM
sales
GROUP BY
DATE_TRUNC('month', sale_date)
ORDER BY
month;
2. Sales by Product Category
SELECT
category,
SUM(sale_amount) AS total_sales
FROM
sales
GROUP BY
category
ORDER BY
total_sales DESC;
3. Sales by Region
SELECT
region,
SUM(sale_amount) AS total_sales
FROM
sales
GROUP BY
region
ORDER BY
total_sales DESC;
4. Sales Performance Over Promotions
SELECT
DATE_TRUNC('week', sale_date) AS week,
promotion_id,
SUM(sale_amount) AS total_sales
FROM
sales
WHERE
promotion_id IS NOT NULL
GROUP BY
DATE_TRUNC('week', sale_date), promotion_id
ORDER BY
week, promotion_id;
5. Sales Trends (Time Series)
SELECT
sale_date,
SUM(sale_amount) AS daily_sales
FROM
sales
GROUP BY
sale_date
ORDER BY
sale_date;
6. Sales Correlation with Economic Index
SELECT
s.sale_date,
SUM(s.sale_amount) AS total_sales,
e.economic_index
FROM
sales s
JOIN
economic_indicators e
ON
s.sale_date = e.date
GROUP BY
s.sale_date, e.economic_index
ORDER BY
s.sale_date;
7. Customer Demographics Analysis
SELECT
age_group,
gender,
SUM(sale_amount) AS total_sales
FROM
sales s
JOIN
customers c
ON
s.customer_id = c.customer_id
GROUP BY
age_group, gender
ORDER BY
total_sales DESC;
8. Top 10 Products by Sales
SELECT
product_id,
product_name,
SUM(sale_amount) AS total_sales
FROM
sales
GROUP BY
product_id, product_name
ORDER BY
total_sales DESC
LIMIT 10;
Execute these SQL queries to ready the data for the next step, where you can visualize the insights using appropriate tools.