Project

Retail Sales Forecasting with Python

A hands-on project to utilize data science techniques to analyze and forecast retail sales data using Python.

Empty image or helper icon

Retail Sales Forecasting with Python

Description

This project aims to use statistical analysis and basic machine learning models to predict future retail sales. Participants will learn to handle data, perform exploratory data analysis, engineer relevant features, and build predictive models. They will also evaluate their models using key metrics and visualize the results to provide insights into the sales forecasting process.

The original prompt:

Project Description: Use data science techniques to analyze and forecast retail sales data. This project will utilize statistical analysis and simple machine learning models to predict future sales based on historical data, promotional information, and economic factors.

Tasks: Load the provided retail sales dataset. Conduct exploratory data analysis to understand key trends and patterns. Create new features that might influence sales, such as day of the week, month, or promotional flags. Split the data into training and test sets. Use a simple linear regression model to predict sales. Evaluate the model’s performance using metrics like MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error). Visualize the actual vs. predicted sales to assess the model’s effectiveness. Expected Outcome: A Jupyter notebook containing the exploratory data analysis, feature engineering, model training, evaluation, and visualizations that provide insights into the sales forecasting process.

This project emphasizes data handling, exploratory analysis, and basic predictive modeling, which are core skills in data science. It’s designed to be approachable for participants with a basic to intermediate understanding of data science principles.

Retail Sales Data Analysis and Forecasting

1. Introduction

In this section, we will load and understand a retail sales dataset. This involves setting up the Python environment, loading the dataset, and performing initial exploratory data analysis (EDA).

2. Setup Instructions

Before we start loading the dataset, make sure you have the necessary Python libraries installed. You can install them using pip if they are not already available.

pip install pandas matplotlib seaborn

3. Loading the Dataset

We'll use the pandas library to load the dataset. Assume the dataset is a CSV file named retail_sales.csv.

import pandas as pd

# Loading the dataset
file_path = 'retail_sales.csv'
sales_data = pd.read_csv(file_path)

# Display the first few rows of the dataset
print(sales_data.head())

4. Understanding the Dataset

4.1 Basic Information

Let's understand the structure of the dataset by checking its basic information such as data types, missing values, and basic statistics.

# Checking the structure of the dataset
print(sales_data.info())

# Summary statistics
print(sales_data.describe())

4.2 Checking for Missing Values

Identify columns with missing values to understand the extent of missing data.

# Checking for missing values
missing_values = sales_data.isnull().sum()
print(missing_values)

4.3 Exploratory Data Analysis (EDA)

4.3.1 Distribution of Sales

Visualize the distribution of sales using histograms and box plots.

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram of sales data
plt.figure(figsize=(10, 6))
sns.histplot(sales_data['sales_amount'], kde=True)
plt.title('distribution of sales')
plt.xlabel('Sales Amount')
plt.ylabel('Frequency')
plt.show()

# Box plot of sales data
plt.figure(figsize=(10, 6))
sns.boxplot(x=sales_data['sales_amount'])
plt.title('Box plot of Sales Amount')
plt.xlabel('Sales Amount')
plt.show()

4.3.2 Time Series Analysis

If the dataset includes a time component, we can analyze sales over time.

# Convert the date column to datetime if it's not already
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Set the date column as the index
sales_data.set_index('date', inplace=True)

# Plotting sales over time
plt.figure(figsize=(14, 8))
sales_data['sales_amount'].plot()
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales Amount')
plt.show()

Conclusion

This concludes the first part of our project - loading and understanding the retail sales data. In the next steps, we will proceed with data cleaning, feature engineering, and forecasting sales using advanced data science techniques.

Exploratory Data Analysis for Key Trends and Patterns

Here is a practical implementation of an exploratory data analysis (EDA) to identify key trends and patterns using Python. We will use pandas, numpy, matplotlib, and seaborn libraries.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Load the Data (Assuming this is done in the previous steps)

# data = pd.read_csv('retail_sales_data.csv')
# Ensure the above line is already handled in the earlier steps
# Here, I'll just proceed assuming `data` is your DataFrame

3. Basic Data Cleaning

# Handling missing values
data.dropna(inplace=True)

# Convert date column to datetime if not already done
data['Date'] = pd.to_datetime(data['Date'])

4. Descriptive Statistics

# General statistics
print(data.describe())

# Checking the data types
print(data.info())

5. Univariate Analysis

Histograms

# Plot histograms for numerical variables
data.hist(bins=30, figsize=(15, 10))
plt.tight_layout()
plt.show()

Box Plots

# Plot boxplots to detect outliers in numerical columns
plt.figure(figsize=(10, 6))
sns.boxplot(data=data.select_dtypes(include=np.number))
plt.xticks(rotation=90)
plt.show()

6. Bivariate Analysis

Scatter Plots

# Scatter plot to see the relationship between two numerical variables
plt.figure(figsize=(10, 6))
plt.scatter(data['Sales'], data['Profit'])
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.title('Sales vs Profit')
plt.show()

Correlation Matrix

# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

7. Time Series Analysis

# Setting the date column as the index
data.set_index('Date', inplace=True)

# Resampling the data to monthly frequency
monthly_data = data.resample('M').sum()

# Plotting time series
plt.figure(figsize=(12, 6))
plt.plot(monthly_data['Sales'], label='Monthly Sales')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Monthly Sales Over Time')
plt.legend()
plt.show()

8. Seasonality and Trend

Decomposition

from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(monthly_data['Sales'], model='additive')
result.plot()
plt.show()

9. Segment Analysis

Sales by Category

# Assuming there is a 'Category' column
plt.figure(figsize=(12, 6))
category_sales = data.groupby('Category')['Sales'].sum().sort_values()
category_sales.plot(kind='bar')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.title('Total Sales by Category')
plt.show()

Sales by Region

# Assuming there is a 'Region' column
plt.figure(figsize=(12, 6))
region_sales = data.groupby('Region')['Sales'].sum().sort_values()
region_sales.plot(kind='bar')
plt.xlabel('Region')
plt.ylabel('Sales')
plt.title('Total Sales by Region')
plt.show()

10. Key Insights Summary

At the conclusion of your EDA, print or log a summary of key insights:

print("Key Insights:")
print(f"Total Sales: {data['Sales'].sum()}")
print(f"Total Profit: {data['Profit'].sum()}")
print(f"Top Category by Sales: {category_sales.idxmax()} with {category_sales.max()} sales")
print(f"Top Region by Sales: {region_sales.idxmax()} with {region_sales.max()} sales")

This code summarizes and visualizes the key trends and patterns of your retail sales data, arming you with critical insights for further analysis or forecasting tasks.

Feature Engineering for Sales Forecasting

In this part of the project, we will perform feature engineering to create meaningful features that can improve the accuracy of our sales forecasting model.

Step 1: Date Features

Extract useful features from the date column in the dataset such as year, month, day of the week, and whether it is a weekend or holiday.

# Assuming the DataFrame is named df and the date column is 'date'

# Import necessary libraries
import pandas as pd
import numpy as np

# Ensure the 'date' column is in datetime format
df['date'] = pd.to_datetime(df['date'])

# Create new date-related features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['is_weekend'] = df['dayofweek'].apply(lambda x: 1 if x >= 5 else 0)

# Function to check if a date is a holiday 
# (You may need to define your own holidays as per your country's calendar)
def is_holiday(date):
    holidays = [
        "2022-01-01", "2022-12-25", "2022-07-04",  # add more holidays
    ]
    return 1 if date.strftime('%Y-%m-%d') in holidays else 0

df['is_holiday'] = df['date'].apply(is_holiday)

Step 2: Lag Features

Create lag features to capture the temporal patterns in sales.

# Create lag features
# Assuming the sales data is recorded daily and the 'sales' column contains the sales figures
for lag in [1, 7, 14, 30]:  # For daily data, common lags are 1 day, 7 days (weekly), 14 days, and 30 days (monthly)
    df[f'sales_lag_{lag}'] = df['sales'].shift(lag)

# Fill NaN values with 0 (or any other appropriate value or imputation method)
df.fillna(0, inplace=True)

Step 3: Rolling Window Features

Calculate rolling statistics such as mean and standard deviation to capture trends and seasonality.

# Create rolling statistics features
for window in [7, 14, 30]:  # Windows of 1 week, 2 weeks, and 1 month
    df[f'sales_roll_mean_{window}'] = df['sales'].rolling(window=window).mean()
    df[f'sales_roll_std_{window}'] = df['sales'].rolling(window=window).std()

# Again, fill NaN values with 0
df.fillna(0, inplace=True)

Step 4: Dealing with Categorical Variables

Convert categorical variables into numeric form (e.g. using one-hot encoding).

# Example categorical variable: 'store_type'
# Assuming the column 'store_type' exists in the DataFrame

# Convert categorical feature 'store_type' into numeric form using one-hot encoding
df = pd.get_dummies(df, columns=['store_type'])

# Check the DataFrame to ensure all changes are applied
print(df.head())

This completes the feature engineering step for your sales forecasting project. You should now have a DataFrame with enriched features that can be used to train a predictive model.

Data Splitting: Preparing Training and Test Sets

To prepare training and test sets in Python, you can use the train_test_split function from the sklearn.model_selection module. This function splits the data into training and test subsets, which will allow you to evaluate the performance of your predictive models.

Here's a practical implementation:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load your dataset
data = pd.read_csv('retail_sales_data.csv')

# Assume 'sales' is the target variable and the rest are features
X = data.drop(columns='sales')
y = data['sales']

# Split the data into 80% training and 20% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display shapes to verify splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Explanation

  1. Importing Libraries: The necessary libraries are imported first. pandas is used for data manipulation and train_test_split for splitting the data.

  2. Loading Dataset: The dataset is loaded using pd.read_csv. Ensure that retail_sales_data.csv is in your working directory or provide the correct path to the file.

  3. Preparing Features and Target Variable: The feature set X is prepared by dropping the 'sales' column from the dataset. The target variable y is set as the 'sales' column.

  4. Splitting the Data: The train_test_split function is used to split X and y into training and test sets. Here, 80% of the data is used for training (X_train, y_train) and 20% for testing (X_test, y_test). The random_state parameter is set to ensure reproducibility.

  5. Verification: The shapes of the resulting splits are printed to verify that the splitting was successful. The training set should be 80% and the test set should be 20% of the original dataset.

You can now proceed to build and evaluate your models using X_train, y_train for training, and X_test, y_test for testing.

Part 5: Introduction to Linear Regression for Sales Prediction

Objective

In this part, we will use linear regression to predict future sales based on the historical retail sales data you've already prepared. Linear regression models the relationship between a dependent variable (sales) and one or more independent variables (features).

Steps

  1. Import Necessary Libraries
  2. Load the Preprocessed Data
  3. Instantiate and Train the Linear Regression Model
  4. Evaluate the Model
  5. Predict Future Sales
# Step 1: Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 2: Load the Preprocessed Data
# Assuming the preprocessed data is stored in 'train_data.csv' and 'test_data.csv'
train_data = pd.read_csv('train_data.csv')
test_data = pd.read_csv('test_data.csv')

# Separate features (X) and target variable (y)
X_train = train_data.drop(columns='sales')
y_train = train_data['sales']
X_test = test_data.drop(columns='sales')
y_test = test_data['sales']

# Step 3: Instantiate and Train the Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 4: Evaluate the Model
# Predict on training data
train_predictions = model.predict(X_train)
# Predict on test data
test_predictions = model.predict(X_test)

# Calculate Mean Squared Error and R^2 Score for training data
train_mse = mean_squared_error(y_train, train_predictions)
train_r2 = r2_score(y_train, train_predictions)

# Calculate Mean Squared Error and R^2 Score for test data
test_mse = mean_squared_error(y_test, test_predictions)
test_r2 = r2_score(y_test, test_predictions)

print(f'Training Data: MSE = {train_mse}, R^2 = {train_r2}')
print(f'Test Data: MSE = {test_mse}, R^2 = {test_r2}')

# Step 5: Predict Future Sales
# Assuming 'future_data.csv' contains the future dates and their corresponding features
future_data = pd.read_csv('future_data.csv')
future_sales_predictions = model.predict(future_data)

# Save the predictions into a CSV file
future_data['Predicted_Sales'] = future_sales_predictions
future_data.to_csv('future_sales_predictions.csv', index=False)

print("Future sales predictions have been saved to 'future_sales_predictions.csv'")

Explanation

  1. Importing Libraries: The necessary libraries for data handling, implementation of linear regression, and error metrics are imported.
  2. Loading Data: Preprocessed training and test data are loaded. These data should already have features engineered and are split into training and testing sets.
  3. Model Training: A linear regression model is instantiated and fitted using the training data.
  4. Evaluation: The model's performance is evaluated on both the training and test data using Mean Squared Error (MSE) and R² Score.
  5. Prediction: The model predicts future sales based on new input data, and the predictions are saved to a CSV file.

This implementation allows you to apply linear regression to predict sales based on historical data, evaluate the performance of the model, and use it to make future sales predictions suitable for real-life applications.

Evaluating Model Performance with MAE and RMSE

In this section, we'll compute two key metrics, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), to evaluate the performance of the linear regression model you trained for retail sales forecasting.

Mean Absolute Error (MAE)

MAE represents the average of absolute differences between predictions and actual observations. It gives an idea of how wrong the predictions were, without considering their direction.

Root Mean Squared Error (RMSE)

RMSE is the square root of the average of squared differences between predictions and actual observations. It penalizes larger errors more significantly due to the squaring part.

Below is the complete implementation to calculate MAE and RMSE using Python:

import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Assuming y_test is the actual test set values and y_pred is the predicted values from the model
y_test = np.array([actual values])  # replace with actual test set values
y_pred = np.array([predicted values])  # replace with predicted values from your model

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse}")

Explanation

  1. Import Required Libraries: We need numpy for numerical operations and mean_absolute_error, mean_squared_error from sklearn.metrics to compute MAE and RMSE.

  2. Prepare Data: y_test holds the actual values of the test set, and y_pred holds the predicted values from your linear regression model.

  3. Calculate MAE: mean_absolute_error(y_test, y_pred) computes the MAE by averaging the absolute differences between actual values and predicted values.

  4. Calculate RMSE: mean_squared_error(y_test, y_pred) first computes the MSE by averaging the squared differences, then np.sqrt computes the square root of MSE to obtain RMSE.

Apply in Real Life

Replace the placeholder arrays for y_test and y_pred with your actual test dataset and predicted values from your sales forecasting model. The printed MAE and RMSE will give you concrete numbers to evaluate how well your model performs.

# Example with real data
y_test = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 145, 210, 240, 310])

mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse}")

By running this fragment with your specific data, you can effectively gauge the prediction accuracy and performance of your retail sales forecasting model.

Visualizing Actual vs. Predicted Sales

To visualize the actual vs. predicted sales, we will use the matplotlib library to produce the comparison plot. This visualization helps in understanding how well our model is performing in predicting sales compared to actual sales data.

Here is the complete implementation:

  1. Import Necessary Libraries: Ensure you have the required libraries installed and imported.

  2. Load the Data: Make sure your dataset including actual and predicted sales is ready.

  3. Create Visualization: Use matplotlib to plot the actual vs. predicted sales.

Implementation

import matplotlib.pyplot as plt
import pandas as pd

# Assuming you have your sales predictions and actual sales in a DataFrame:
# sales_data should contain columns like 'date', 'actual_sales', 'predicted_sales'

# Example DataFrame creation
data = {
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
    'actual_sales': [100, 150, 200, 250, 300],
    'predicted_sales': [110, 160, 195, 240, 310]
}
sales_data = pd.DataFrame(data)

# Convert 'date' column to datetime if not already in that format
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Plotting the Data
plt.figure(figsize=(10, 6))
plt.plot(sales_data['date'], sales_data['actual_sales'], marker='o', linestyle='-', color='b', label='Actual Sales')
plt.plot(sales_data['date'], sales_data['predicted_sales'], marker='x', linestyle='--', color='r', label='Predicted Sales')

# Add titles and labels
plt.title('Actual vs. Predicted Sales')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)

# Show the plot
plt.show()

Explanation

  • Loading the Libraries: Matplotlib is used for plotting, and Pandas is used to manage the data.
  • Creating Dummy Data: For demonstration purposes, a sample DataFrame sales_data is created. In practice, you should replace this with your actual dataset.
  • Data Preparation: Ensure the date column is in datetime format.
  • Plotting Actual vs. Predicted Sales:
    • The actual sales are plotted with a solid line and round markers.
    • The predicted sales are plotted with a dashed line and 'x' markers.
  • Labels and Grid: Titles, labels, and grid are added for better readability.

This code will generate a plot comparing the actual and predicted sales values over time, allowing for a straightforward visual analysis of your model’s performance.

Insights and Reporting

Once we have built our sales prediction model and evaluated its performance, it is crucial to generate insights and report them effectively. Below is a step-by-step guide to create insights and reporting using Python.

Step 1: Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

Step 2: Load Data

Assume sales_test.csv contains the test set with actual sales and predicted sales columns.

df = pd.read_csv("sales_test.csv")

Step 3: Calculate Performance Metrics

# Calculating Mean Absolute Error (MAE)
mae = mean_absolute_error(df['actual_sales'], df['predicted_sales'])

# Calculating Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(df['actual_sales'], df['predicted_sales']))

Step 4: Generate Summary Statistics

summary_statistics = df.describe()

# Calculating additional statistics if necessary
mean_actual_sales = df['actual_sales'].mean()
mean_predicted_sales = df['predicted_sales'].mean()

# Computing summary insights
summary_insights = {
    "MAE": mae,
    "RMSE": rmse,
    "Mean Actual Sales": mean_actual_sales,
    "Mean Predicted Sales": mean_predicted_sales
}

for key, value in summary_insights.items():
    print(f"{key}: {value:.2f}")

Step 5: Visualize the Results

Plot 1: Actual vs. Predicted Sales

plt.figure(figsize=(10, 6))
sns.lineplot(x=df.index, y=df['actual_sales'], label='Actual Sales')
sns.lineplot(x=df.index, y=df['predicted_sales'], label='Predicted Sales')
plt.title('Actual vs Predicted Sales')
plt.xlabel('Time')
plt.ylabel('Sales')
plt.legend()
plt.show()

Plot 2: Error Distribution

df['error'] = df['actual_sales'] - df['predicted_sales']
plt.figure(figsize=(10, 6))
sns.histplot(df['error'], kde=True)
plt.title('Error Distribution (Actual - Predicted)')
plt.xlabel('Error')
plt.ylabel('Frequency')
plt.show()

Step 6: Generate a Report

Below is the code to generate a simple text-based report about the metrics and key insights.

report = f"""
Sales Prediction Model Report

Performance Metrics:
---------------------
Mean Absolute Error (MAE): {mae:.2f}
Root Mean Squared Error (RMSE): {rmse:.2f}

Summary Statistics:
--------------------
Mean Actual Sales: {mean_actual_sales:.2f}
Mean Predicted Sales: {mean_predicted_sales:.2f}

Error Analysis:
---------------
The error distribution suggests that the model is [describe the bias or any anomalies if applicable].

Visualization:
--------------
1. The line plot of 'Actual vs Predicted Sales' demonstrates the model's effectiveness in following the sales trend.
2. The histogram of 'Error Distribution' helps in understanding the discrepancy between actual and predicted values.

"""
print(report)

Conclusion

Once you have this code integrated, you will not only be able to gather critical insights but also provide a comprehensive report that can be used for further analysis and decision-making. This implementation should fit seamlessly into your project, providing the necessary insights and reporting capabilities for your retail sales forecast.