Project

SQL-Based Data Analysis for Retail Sales

This project involves using SQL to analyze sales data from a retail store and generating insights that can be visualized in various types of charts.

Empty image or helper icon

SQL-Based Data Analysis for Retail Sales

Description

The goal of this project is to leverage SQL for data extraction, transformation, and analysis focused on retail sales data. This project will help identify sales patterns, the effect of promotions, and the impact of economic indices. The analysis will produce subsets of data suitable for visualization, enabling better decision-making.

The original prompt:

Based on this dataset can you work out various types of analysis that can be completed and then create specific SQL code to get subsets of data that could be turned into charts to highlight the analysis.

I'm looking for a range of SQL code that could be easily used for the analysis of this dataset

Date Store Category Sales Economic_Index Promotion 1/1/2019 Store A Electronics 412 99.44954273 0 1/2/2019 Store A Electronics 202 98.85525499 0 1/3/2019 Store A Electronics 299 101.1858996 0 1/4/2019 Store A Electronics 181 98.69758959 1 1/5/2019 Store A Electronics 490 97.86122082 1 1/6/2019 Store A Electronics 306 103.6332865 1 1/7/2019 Store A Electronics 327 99.49102562 1 1/8/2019 Store A Electronics 197 102.1630683 0 1/9/2019 Store A Electronics 252 92.6185201 0 1/10/2019 Store A Electronics 413 98.16149309 0 1/11/2019 Store A Electronics 219 98.27721518 1 1/12/2019 Store A Electronics 188 104.2501821 0 1/13/2019 Store A Electronics 455 101.9019185 0 1/14/2019 Store A Electronics 169 103.4198824 0 1/15/2019 Store A Electronics 472 99.72160068 0

Setting Up the Database

1. Create Database

CREATE DATABASE RetailStoreDB;

2. Use the Database

USE RetailStoreDB;

3. Create Tables

Customers Table

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    Email VARCHAR(100),
    Phone VARCHAR(15),
    Address VARCHAR(255),
    City VARCHAR(50),
    State VARCHAR(50),
    ZipCode VARCHAR(10)
);

Products Table

CREATE TABLE Products (
    ProductID INT PRIMARY KEY,
    ProductName VARCHAR(100),
    Category VARCHAR(50),
    Price DECIMAL(10, 2),
    StockQuantity INT
);

Orders Table

CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    CustomerID INT,
    OrderDate DATE,
    TotalAmount DECIMAL(10, 2),
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

OrderDetails Table

CREATE TABLE OrderDetails (
    OrderDetailID INT PRIMARY KEY,
    OrderID INT,
    ProductID INT,
    Quantity INT,
    Price DECIMAL(10, 2),
    FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
    FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);

4. Insert Sample Data

Insert Customers

INSERT INTO Customers (CustomerID, FirstName, LastName, Email, Phone, Address, City, State, ZipCode)
VALUES 
(1, 'John', 'Doe', 'john.doe@example.com', '555-1234', '123 Elm St', 'Anytown', 'CA', '12345'),
(2, 'Jane', 'Smith', 'jane.smith@example.com', '555-5678', '456 Oak St', 'Anycity', 'TX', '67890');

Insert Products

INSERT INTO Products (ProductID, ProductName, Category, Price, StockQuantity)
VALUES 
(1, 'Laptop', 'Electronics', 999.99, 50),
(2, 'Headphones', 'Electronics', 199.99, 200);

Insert Orders

INSERT INTO Orders (OrderID, CustomerID, OrderDate, TotalAmount)
VALUES 
(1, 1, '2023-10-01', 1199.98),
(2, 2, '2023-10-03', 199.99);

Insert OrderDetails

INSERT INTO OrderDetails (OrderDetailID, OrderID, ProductID, Quantity, Price)
VALUES 
(1, 1, 1, 1, 999.99),
(2, 1, 2, 1, 199.99),
(3, 2, 2, 1, 199.99);

Data Ingestion

-- Creating the table to ingest sales data 
CREATE TABLE sales_data (
    sale_id INT PRIMARY KEY,
    date DATE,
    store_id INT,
    product_id INT,
    quantity_sold INT,
    sales_amount DECIMAL(10, 2)
);

-- Example of inserting some data entries
INSERT INTO sales_data (sale_id, date, store_id, product_id, quantity_sold, sales_amount) VALUES
(1, '2023-01-01', 101, 201, 10, 150.00),
(2, '2023-01-02', 102, 202, 15, 225.00),
(3, '2023-01-03', 103, 203, 8, 120.00),
(4, '2023-01-04', 104, 204, 20, 300.00),
(5, '2023-01-05', 105, 205, 5, 75.00);

-- Ingestion process could involve loading data from an external file like CSV
-- Using the COPY command in PostgreSQL as an example
-- Make sure the file path and column mapping correspond to your setup

COPY sales_data (sale_id, date, store_id, product_id, quantity_sold, sales_amount)
FROM '/path/to/your/sales_data.csv'
DELIMITER ','
CSV HEADER;

-- Verification
SELECT * FROM sales_data LIMIT 10;

-- You can also use an INSERT INTO ... SELECT to load data from another table
INSERT INTO sales_data (sale_id, date, store_id, product_id, quantity_sold, sales_amount)
SELECT sale_id, date, store_id, product_id, quantity_sold, sales_amount
FROM another_table;

Assumptions

  • Data formats align with the structure specified.
  • Paths to files and file formats are set correctly.
  • Necessary permissions and correct paths are in place for file ingestion.

SQL Data Cleaning and Preparation

1. Remove Duplicate Records

DELETE FROM sales_data
WHERE id IN (
    SELECT id
    FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY sale_id ORDER BY id) AS row_num
        FROM sales_data
    ) t
    WHERE t.row_num > 1
);

2. Handle Missing Values

  • For numeric columns
UPDATE sales_data
SET column_name = 0
WHERE column_name IS NULL;
  • For string columns
UPDATE sales_data
SET column_name = 'Unknown'
WHERE column_name IS NULL;

3. Correct Inconsistent Data Formats

  • Convert date formats
UPDATE sales_data
SET sale_date = STR_TO_DATE(sale_date, '%m/%d/%Y')
WHERE STR_TO_DATE(sale_date, '%m/%d/%Y') IS NOT NULL;

4. Standardize Categorical Values

UPDATE sales_data
SET category = 'Electronics'
WHERE category IN ('electronics', 'ELECTRONICS', 'elec');

5. Remove Outliers

DELETE FROM sales_data
WHERE sale_amount > (
    SELECT AVG(sale_amount) + 3 * STDDEV(sale_amount)
    FROM sales_data
);

6. Index Columns for Better Performance

CREATE INDEX idx_sale_date ON sales_data (sale_date);
CREATE INDEX idx_product_id ON sales_data (product_id);
CREATE INDEX idx_customer_id ON sales_data (customer_id);

7. Normalize Data

  • For example, normalizing sale amount
UPDATE sales_data
SET sale_amount = (sale_amount - (SELECT AVG(sale_amount) FROM sales_data)) / 
                  (SELECT STDDEV(sale_amount) FROM sales_data);

8. Create Cleaned and Prepared Table

CREATE TABLE cleaned_sales_data AS
SELECT *
FROM sales_data
WHERE sale_date IS NOT NULL
  AND product_id IS NOT NULL
  AND customer_id IS NOT NULL;

9. Add Constraints

  • Ensure no duplicate sales
ALTER TABLE cleaned_sales_data
ADD CONSTRAINT unique_sale UNIQUE (sale_id);
  • Add foreign key constraints
ALTER TABLE cleaned_sales_data
ADD CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(id);

ALTER TABLE cleaned_sales_data
ADD CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES customers(id);

Basic Descriptive Analysis in SQL

Use the following SQL queries to perform basic descriptive analysis on sales data from a retail store. This analysis includes computing overall sales, average sales, total number of transactions, average quantity per transaction.

1. Overall Sales

To calculate the overall sales:

SELECT 
    SUM(sales_amount) AS overall_sales
FROM 
    sales_data;

2. Total Number of Transactions

To calculate the total number of transactions:

SELECT 
    COUNT(transaction_id) AS total_transactions
FROM 
    sales_data;

3. Average Sales per Transaction

To calculate the average sales per transaction:

SELECT 
    AVG(sales_amount) AS average_sales_per_transaction
FROM 
    sales_data;

4. Average Quantity per Transaction

To calculate the average quantity of items sold per transaction:

SELECT 
    AVG(quantity) AS average_quantity_per_transaction
FROM 
    sales_data;

5. Descriptive Statistics for Sales Amount

To get descriptive statistics like minimum, maximum, and average sales amount:

SELECT 
    MIN(sales_amount) AS min_sales_amount,
    MAX(sales_amount) AS max_sales_amount,
    AVG(sales_amount) AS avg_sales_amount
FROM 
    sales_data;

6. Descriptive Statistics for Quantity

To get descriptive statistics like minimum, maximum, and average quantity sold:

SELECT 
    MIN(quantity) AS min_quantity,
    MAX(quantity) AS max_quantity,
    AVG(quantity) AS avg_quantity
FROM 
    sales_data;

7. Sales by Product Category

To calculate the total sales for each product category:

SELECT 
    category,
    SUM(sales_amount) AS total_sales
FROM 
    sales_data
GROUP BY 
    category;

8. Sales by Month

To calculate the sales for each month:

SELECT 
    DATE_FORMAT(sale_date, '%Y-%m') AS month,
    SUM(sales_amount) AS total_sales
FROM 
    sales_data
GROUP BY 
    DATE_FORMAT(sale_date, '%Y-%m');

Remember, these SQL queries assume your table sales_data includes fields like sales_amount, transaction_id, quantity, category, and sale_date. Adapt field names as necessary.

-- Impact of Promotions
-- Analyzing the impact of promotions on sales in a retail store.

-- 1. Calculate sales before, during, and after promotion periods
-- Assuming sales table has these columns: sale_id, product_id, sale_date, sale_amount, promotion_id
-- Promotions table has these columns: promotion_id, promo_start_date, promo_end_date

-- Step 1: Create a temporary table to get results for each promotion period

WITH PromotionImpact AS (
    SELECT 
        p.promotion_id,
        p.promo_start_date,
        p.promo_end_date,
        -- Calculate sales during the promotion
        SUM(CASE WHEN s.sale_date BETWEEN p.promo_start_date AND p.promo_end_date THEN s.sale_amount ELSE 0 END) AS sales_during_promo,
        -- Calculate sales before the promotion (one month prior)
        SUM(CASE WHEN s.sale_date BETWEEN DATEADD(month, -1, p.promo_start_date) AND DATEADD(day, -1, p.promo_start_date) THEN s.sale_amount ELSE 0 END) AS sales_before_promo,
        -- Calculate sales after the promotion (one month after)
        SUM(CASE WHEN s.sale_date BETWEEN DATEADD(day, 1, p.promo_end_date) AND DATEADD(month, 1, p.promo_end_date) THEN s.sale_amount ELSE 0 END) AS sales_after_promo
    FROM 
        promotions p
    LEFT JOIN 
        sales s ON p.promotion_id = s.promotion_id
    GROUP BY 
        p.promotion_id, p.promo_start_date, p.promo_end_date
)

-- Step 2: Query to fetch and display the impact of promotions
SELECT
    promotion_id,
    promo_start_date,
    promo_end_date,
    sales_during_promo,
    sales_before_promo,
    sales_after_promo,
    sales_during_promo - sales_before_promo AS increase_during_promo,
    sales_after_promo - sales_before_promo AS post_promo_change
FROM 
    PromotionImpact;

This SQL script calculates the sales before, during, and after promotion periods to help assess the impact of promotions on sales.

-- Economic Index Correlation

-- Assuming the existence of a table `sales_data` with columns: `date`, `store_id`, `sales_amount`
-- and a table `economic_data` with columns: `date`, `economic_index`

-- Step 1: Join `sales_data` with `economic_data` on the date

WITH sales_economic AS (
    SELECT
        s.date,
        s.store_id,
        s.sales_amount,
        e.economic_index
    FROM
        sales_data s
    JOIN
        economic_data e ON s.date = e.date
)

-- Step 2: Calculate monthly aggregated sales and average economic index

, monthly_data AS (
    SELECT
        DATE_TRUNC('month', date) AS month,
        SUM(sales_amount) AS total_sales,
        AVG(economic_index) AS avg_economic_index
    FROM
        sales_economic
    GROUP BY
        DATE_TRUNC('month', date)
)

-- Step 3: Calculate correlations between total sales and economic index

SELECT
    CORR(total_sales, avg_economic_index) AS correlation
FROM
    monthly_data;

The above SQL script computes the correlation between sales data and an economic index. Make sure your actual table and column names might differ, so replace them as needed. This script assumes the data is already cleaned and prepared for analysis.

Time Series Analysis using SQL

Create Time Series View for Daily Sales

CREATE VIEW daily_sales AS
SELECT 
    DATE(sale_date) AS sale_day, 
    SUM(sale_amount) AS total_sales
FROM 
    sales_data
GROUP BY 
    DATE(sale_date);

Monthly Aggregation

CREATE VIEW monthly_sales AS
SELECT 
    DATE_TRUNC('month', sale_day) AS sale_month, 
    SUM(total_sales) AS total_sales
FROM 
    daily_sales
GROUP BY 
    DATE_TRUNC('month', sale_day);

Yearly Aggregation

CREATE VIEW yearly_sales AS
SELECT 
    DATE_TRUNC('year', sale_day) AS sale_year, 
    SUM(total_sales) AS total_sales
FROM 
    daily_sales
GROUP BY 
    DATE_TRUNC('year', sale_day);

Year-over-Year Growth

CREATE VIEW yearly_growth AS
SELECT 
    sale_year, 
    total_sales, 
    LAG(total_sales) OVER (ORDER BY sale_year) AS previous_year_sales,
    ((total_sales - LAG(total_sales) OVER (ORDER BY sale_year)) / LAG(total_sales) OVER (ORDER BY sale_year)) * 100 AS yoy_growth
FROM 
    yearly_sales;

Moving Average (7-day)

CREATE VIEW moving_avg_7_days AS
SELECT 
    sale_day,
    total_sales,
    AVG(total_sales) OVER (ORDER BY sale_day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_sales_7_days
FROM 
    daily_sales;

Seasonal Decomposition

-- Assuming sales_data table has columns sale_date and sale_amount

-- Decompose into daily seasonality pattern
CREATE VIEW daily_seasonality AS
SELECT 
    EXTRACT(DOW FROM sale_day) AS day_of_week,
    AVG(total_sales) AS avg_sales
FROM 
    daily_sales
GROUP BY 
    EXTRACT(DOW FROM sale_day);

Identify Trends

-- Identify trends by comparing moving averages with actual sales
CREATE VIEW trend_identification AS
SELECT
    ds.sale_day,
    ds.total_sales,
    ma.avg_sales_7_days,
    CASE 
        WHEN ds.total_sales > ma.avg_sales_7_days THEN 'Up'
        WHEN ds.total_sales < ma.avg_sales_7_days THEN 'Down'
        ELSE 'Stable'
    END AS trend
FROM
    daily_sales ds
INNER JOIN 
    moving_avg_7_days ma ON ds.sale_day = ma.sale_day;

Outlier Detection

-- Detect outliers using a simple standard deviation method
WITH sales_stats AS (
    SELECT 
        AVG(total_sales) AS avg_sales,
        STDDEV(total_sales) AS stddev_sales
    FROM 
        daily_sales
)
SELECT 
    ds.sale_day, 
    ds.total_sales,
    CASE 
        WHEN ds.total_sales > ss.avg_sales + 2 * ss.stddev_sales THEN 'High Outlier'
        WHEN ds.total_sales < ss.avg_sales - 2 * ss.stddev_sales THEN 'Low Outlier'
        ELSE 'Normal'
    END AS outlier_status
FROM 
    daily_sales ds, sales_stats ss;

Forecasting (Simple Linear Regression)

-- Example of a simple forecasting model using linear regression
-- Intercept and slope calculations for total_sales vs sale_day
WITH regression AS (
    SELECT 
        REGR_INTERCEPT(total_sales, EXTRACT(EPOCH FROM sale_day)) AS intercept,
        REGR_SLOPE(total_sales, EXTRACT(EPOCH FROM sale_day)) AS slope
    FROM  
        daily_sales
)
SELECT 
    ds.sale_day,
    ds.total_sales,
    r.intercept + r.slope * EXTRACT(EPOCH FROM ds.sale_day) AS forecasted_sales
FROM 
    daily_sales ds, regression r;

Visualization Preparation

Here are SQL queries to generate the datasets required for different types of visualizations. Execute these queries in your SQL environment to prepare the data for visualization:

1. Total Sales Per Month

SELECT
    DATE_TRUNC('month', sale_date) AS month,
    SUM(sale_amount) AS total_sales
FROM
    sales
GROUP BY
    DATE_TRUNC('month', sale_date)
ORDER BY
    month;

2. Sales by Product Category

SELECT
    category,
    SUM(sale_amount) AS total_sales
FROM
    sales
GROUP BY
    category
ORDER BY
    total_sales DESC;

3. Sales by Region

SELECT
    region,
    SUM(sale_amount) AS total_sales
FROM
    sales
GROUP BY
    region
ORDER BY
    total_sales DESC;

4. Sales Performance Over Promotions

SELECT
    DATE_TRUNC('week', sale_date) AS week,
    promotion_id,
    SUM(sale_amount) AS total_sales
FROM
    sales
WHERE
    promotion_id IS NOT NULL
GROUP BY
    DATE_TRUNC('week', sale_date), promotion_id
ORDER BY
    week, promotion_id;

5. Sales Trends (Time Series)

SELECT
    sale_date,
    SUM(sale_amount) AS daily_sales
FROM
    sales
GROUP BY
    sale_date
ORDER BY
    sale_date;

6. Sales Correlation with Economic Index

SELECT
    s.sale_date,
    SUM(s.sale_amount) AS total_sales,
    e.economic_index
FROM
    sales s
JOIN
    economic_indicators e
ON
    s.sale_date = e.date
GROUP BY
    s.sale_date, e.economic_index
ORDER BY
    s.sale_date;

7. Customer Demographics Analysis

SELECT
    age_group,
    gender,
    SUM(sale_amount) AS total_sales
FROM
    sales s
JOIN
    customers c
ON
    s.customer_id = c.customer_id
GROUP BY
    age_group, gender
ORDER BY
    total_sales DESC;

8. Top 10 Products by Sales

SELECT
    product_id,
    product_name,
    SUM(sale_amount) AS total_sales
FROM
    sales
GROUP BY
    product_id, product_name
ORDER BY
    total_sales DESC
LIMIT 10;

Execute these SQL queries to ready the data for the next step, where you can visualize the insights using appropriate tools.