Project

Analyzing Sales Data for Retail Business

A comprehensive business intelligence project in Python using Google Colab to analyze and visualize retail sales data.

Empty image or helper icon

Analyzing Sales Data for Retail Business

Description

This project dives into a real-world business problem faced by a retail company trying to understand its sales performance. By analyzing a dataset containing sales transactions, we will uncover insights and trends to inform better decision-making. We will use Python libraries like Pandas, NumPy, Matplotlib, and Seaborn, and implement data preprocessing, exploratory data analysis, and visualization techniques.

The original prompt:

I want to create a detailed BI notebook using Google Collab. Can you come up with the scenario and then walk through examples in order to build out the notebook. Please make the scenario based on a real business problem that we are tring to analyze. Focus more on data analytics scenarios

Project: Comprehensive Business Intelligence - Unit 1: Data Import and Cleaning

Introduction

This section will cover the first part of our project - importing and cleaning the retail sales data. We will use Python and Google Colab. By the end, our data should be ready for analysis and visualization.

Setup Instructions

  1. Sign in to your Google account and open Google Colab.
  2. Create a new notebook.

Data Import and Cleaning

Step 1: Import Libraries

import pandas as pd
import numpy as np

Step 2: Mount Google Drive

Make sure your data file is uploaded to your Google Drive.

from google.colab import drive
drive.mount('/content/drive')

Step 3: Load the Dataset

Replace 'path_to_your_file.csv' with the actual path to your CSV file in your Google Drive.

file_path = '/content/drive/My Drive/path_to_your_file.csv'
data = pd.read_csv(file_path)

Step 4: Inspect the Dataset

# Display the first few rows
data.head()

# Display data types of each column
data.info()

# Get summary statistics
data.describe()

Step 5: Handle Missing Values

# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

# Drop rows with missing values (if necessary)
data_cleaned = data.dropna()

# Alternatively, fill missing values
# data_cleaned = data.fillna({
#     'column1': 'default_value1',
#     'column2': 'default_value2'
# })

Step 6: Remove Duplicates

# Check for duplicates
duplicate_rows = data_cleaned[data_cleaned.duplicated()]
print(duplicate_rows)

# Drop duplicates
data_cleaned = data_cleaned.drop_duplicates()

Step 7: Data Type Conversion

Convert columns to appropriate data types if necessary.

data_cleaned['date_column'] = pd.to_datetime(data_cleaned['date_column'])
data_cleaned['numeric_column'] = pd.to_numeric(data_cleaned['numeric_column'])

Step 8: Feature Engineering

Create additional features if needed.

# Example: Extract year and month from date
data_cleaned['year'] = data_cleaned['date_column'].dt.year
data_cleaned['month'] = data_cleaned['date_column'].dt.month

Conclusion

By this point, you should have a cleaned dataset ready for analysis and visualization. This sets the foundation for subsequent steps in our comprehensive business intelligence project.

Data Exploration and Preliminary Insights

Once data import and cleaning are completed, the next step is to explore the dataset and derive some preliminary insights. Below is a practical implementation in Python:

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assume df is your cleaned DataFrame

# 1. Basic DataFrame Information
print("DataFrame Information:")
print(df.info())

print("\nSummary Statistics:")
print(df.describe())

# 2. Missing Values
print("\nMissing values per column:")
print(df.isnull().sum())

# 3. Distribution of Numerical Features
numerical_features = df.select_dtypes(include=['float64', 'int64']).columns
print("\nNumerical Features Distribution:")
df[numerical_features].hist(figsize=(14, 10))
plt.show()

# 4. Distribution of Categorical Features
categorical_features = df.select_dtypes(include=['object']).columns
print("\nCategorical Features Distribution:")
for feature in categorical_features:
    plt.figure(figsize=(10, 5))
    sns.countplot(data=df, x=feature)
    plt.title(f'Distribution of {feature}')
    plt.show()

# 5. Correlation Analysis
print("\nCorrelation Matrix:")
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# 6. Sales Trend Analysis
# Example: Assuming you have 'Date' and 'Sales' columns
df['Date'] = pd.to_datetime(df['Date'])
sales_trend = df.groupby(df['Date'].dt.to_period('M')).sum()['Sales']
print("\nSales Trend Over Time:")
sales_trend.plot(figsize=(14, 7))
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()

# 7. Top Products
# Example: Assuming you have 'Product' and 'Sales' columns
top_products = df.groupby('Product').sum()['Sales'].sort_values(ascending=False).head(10)
print("\nTop 10 Products by Sales:")
print(top_products)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.index, y=top_products.values)
plt.title('Top 10 Products by Sales')
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()

# 8. Customer Segmentation
# Example: Assuming you have 'CustomerID' and 'Sales' columns
customer_sales = df.groupby('CustomerID').sum()['Sales']
print("\nCustomer Sales Distribution:")
plt.figure(figsize=(12, 6))
sns.histplot(customer_sales, bins=30, kde=True)
plt.title('Customer Sales Distribution')
plt.xlabel('Total Sales')
plt.ylabel('Number of Customers')
plt.show()

Ensure the DataFrame df is pre-loaded in your Google Colab environment, based on your previous steps (Data Import and Cleaning). The code will provide:

  1. Overview of the DataFrame.
  2. Detailed summary statistics.
  3. Visualization of missing data.
  4. Histogram distributions for numerical features.
  5. Count plots for categorical features.
  6. Correlation heatmap.
  7. Time series analysis of sales.
  8. Visualization of top products by sales.
  9. Distribution of customer sales.

Customize column names as per your dataset if they differ. This implementation can be applied directly to explore and derive initial insights from the cleaned retail sales data.

Analyzing Sales Trends Over Time

Loading Libraries and Data

We are focusing on time-series analysis to evaluate sales trends.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming the data has been cleaned and imported already, stored in df
# Make sure the 'date' column is in datetime format:
df['date'] = pd.to_datetime(df['date'])

Resampling Data for Time Series Analysis

Aggregating Monthly Sales

# Assuming 'sales' is the column name for sales amounts
monthly_sales = df.resample('M', on='date').sum()

Visualizing Sales Trends

Line Plot of Monthly Sales

plt.figure(figsize=(12, 6))
plt.plot(monthly_sales.index, monthly_sales['sales'], marker='o', linestyle='-')
plt.title('Monthly Sales Trend')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.grid(True)
plt.show()

Visualizing Yearly Sales Trends

# Extracting year and month for grouping
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

# Grouping by year and month for heatmap
year_month_sales = df.pivot_table(values='sales', index='year', columns='month', aggfunc='sum')

plt.figure(figsize=(12, 8))
sns.heatmap(year_month_sales, annot=True, fmt=".1f", linewidths=.5, cmap="YlGnBu")
plt.title('Monthly Sales Heatmap Over Years')
plt.xlabel('Month')
plt.ylabel('Year')
plt.show()

Rolling Mean for Trend Analysis

Calculating and Plotting Rolling Mean

# Set rolling window by months, e.g., 3-month window
df.set_index('date', inplace=True)
rolling_sales = df['sales'].rolling(window=3).mean()

plt.figure(figsize=(12, 6))
plt.plot(df.index, df['sales'], marker='.', linestyle='-', label='Original Sales')
plt.plot(rolling_sales.index, rolling_sales, marker='o', color='red', linestyle='-', label='3-Month Rolling Mean')
plt.title('Sales Trend with 3-Month Rolling Mean')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()

Seasonal Decomposition

Using Statsmodels for Seasonal Decomposition

import statsmodels.api as sm

# Decompose the time series using seasonal decomposition
decomposition = sm.tsa.seasonal_decompose(df['sales'], model='additive')

# Plotting the decomposition
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(15, 12))
decomposition.observed.plot(ax=ax1, title='Observed')
decomposition.trend.plot(ax=ax2, title='Trend')
decomposition.seasonal.plot(ax=ax3, title='Seasonal')
decomposition.resid.plot(ax=ax4, title='Residual')

plt.tight_layout()
plt.show()

This implementation covers the analysis of sales trends over time and provides visual insights using line plots, heatmaps, rolling means, and seasonal decomposition. The code should be directly executable in Google Colab, given that the data is pre-cleaned and properly formatted.

Category-Level Sales Analysis in Python

Here's a step-by-step practical implementation of category-level sales analysis using Python, assuming you’re using a Jupyter environment such as Google Colab.

Import Necessary Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load Cleaned Data

Here, it is assumed that you have a cleaned DataFrame named sales_data loaded from your previous steps.

Group Sales Data by Category

category_sales = sales_data.groupby('Category')['Sales'].sum().reset_index()

Plot Sales by Category

plt.figure(figsize=(12, 6))
sns.barplot(x='Category', y='Sales', data=category_sales, palette='viridis')
plt.title('Total Sales by Category')
plt.xlabel('Category')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()

Analyze Category with Highest and Lowest Sales

highest_sales_category = category_sales.loc[category_sales['Sales'].idxmax()]
lowest_sales_category = category_sales.loc[category_sales['Sales'].idxmin()]

print(f"The category with the highest sales is: {highest_sales_category['Category']} with total sales of {highest_sales_category['Sales']:.2f}")
print(f"The category with the lowest sales is: {lowest_sales_category['Category']} with total sales of {lowest_sales_category['Sales']:.2f}")

Sales Distribution by Category Over Time

plt.figure(figsize=(14, 7))
category_sales_over_time = sales_data.groupby(['Category', 'Order Date'])['Sales'].sum().unstack('Category').fillna(0)
category_sales_over_time.plot(ax=plt.gca(), linewidth=2)

plt.title('Sales Distribution by Category Over Time')
plt.xlabel('Order Date')
plt.ylabel('Sales')
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Analysis of Average Sales per Transaction by Category

average_sales_by_category = sales_data.groupby('Category')['Sales'].mean().reset_index()

plt.figure(figsize=(12, 6))
sns.barplot(x='Category', y='Sales', data=average_sales_by_category, palette='viridis')
plt.title('Average Sales per Transaction by Category')
plt.xlabel('Category')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.show()

print("Average Sales per Transaction by Category:")
print(average_sales_by_category)

Summary

With this implementation, you’ve performed an analysis on the sales data at the category level, visualized the data, and identified key insights such as highest and lowest sales categories, sales distribution over time, and average sales per transaction for each category. This provides a comprehensive understanding of how different product categories are performing in terms of sales.

Apply this implementation in your comprehensive BI project to enhance your insights on retail sales data.

Regional Sales Analysis

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming `df_sales` is the DataFrame containing cleaned sales data
# and the DataFrame has columns: 'Region', 'Sales', 'Date', 'Category', etc.

# Aggregate sales data by region
region_sales = df_sales.groupby('Region')['Sales'].sum().reset_index()

# Sorting the sales data by region for better visualization
region_sales = region_sales.sort_values(by='Sales', ascending=False)

# Visualization of sales data by region using a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='Region', y='Sales', data=region_sales, palette='viridis')
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()

# Further analysis: Sales Trends by Region Over Time
# Aggregating sales data by region and date
region_date_sales = df_sales.groupby(['Region', 'Date'])['Sales'].sum().reset_index()

# Creating a line plot to visualize sales trends for each region over time
plt.figure(figsize=(14, 8))
sns.lineplot(x='Date', y='Sales', hue='Region', data=region_date_sales, palette='tab10')
plt.title('Sales Trends by Region Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.show()

# More granular insights: Regional Sales by Category
region_category_sales = df_sales.groupby(['Region', 'Category'])['Sales'].sum().reset_index()

# Pivoting data to have a better view for plotting
pivot_data = region_category_sales.pivot(index='Region', columns='Category', values='Sales')

# Heatmap for visualizing sales distribution across categories in different regions
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_data, annot=True, fmt=".1f", cmap='YlGnBu', linewidths=.5)
plt.title('Sales by Category and Region')
plt.xlabel('Category')
plt.ylabel('Region')
plt.show()

This implementation covers:

  1. Aggregating and visualizing total sales by region.
  2. Analyzing sales trends over time per region.
  3. Comparing regional sales by category using a heatmap for more nuanced insights.

This approach provides clear, actionable insights into regional performance and potential areas for business improvement.

6. Customer Segmentation

Introduction

Customer segmentation involves dividing customers into distinct groups based on specific characteristics. For this section, we'll use clustering algorithms from the sklearn library to segment customers based on their purchasing behavior. Specifically, we will use the K-Means clustering algorithm.

Implementation

Import Necessary Libraries

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

Load Data

Assuming the cleaned data is stored in a DataFrame called df. The DataFrame should have columns: CustomerID, InvoiceNo, InvoiceDate, Quantity, and UnitPrice.

# Sample structure of DataFrame, adjust column names if different
df = pd.read_csv('cleaned_retail_sales_data.csv')

# Creating monetary value column
df['TotalSpent'] = df['Quantity'] * df['UnitPrice']

# Remove any negative values or returns (if not already done)
df = df[df['TotalSpent'] > 0]

RFM (Recency, Frequency, Monetary)

Calculate Recency
# Assuming the most recent transaction date is the cutoff date
import datetime as dt
cutoff_date = df['InvoiceDate'].max() + dt.timedelta(1)

recency_df = df.groupby('CustomerID').InvoiceDate.max().reset_index()
recency_df.columns = ['CustomerID', 'LastPurchaseDate']
recency_df['Recency'] = (cutoff_date - recency_df['LastPurchaseDate']).dt.days
Calculate Frequency
frequency_df = df.groupby('CustomerID').InvoiceNo.nunique().reset_index()
frequency_df.columns = ['CustomerID', 'Frequency']
Calculate Monetary
monetary_df = df.groupby('CustomerID').TotalSpent.sum().reset_index()
monetary_df.columns = ['CustomerID', 'Monetary']
Combine RFM
rfm_df = recency_df.merge(frequency_df, on='CustomerID').merge(monetary_df, on='CustomerID')
rfm_df.set_index('CustomerID', inplace=True)

Standardize the Values

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_df)

K-Means Clustering

# Determine optimal number of clusters using Elbow Method
sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(rfm_scaled)
    sse.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.title('Elbow Method For Optimal k')
plt.show()

# Apply KMeans with chosen number of clusters (example uses k=4)
kmeans = KMeans(n_clusters=4)
rfm_df['Cluster'] = kmeans.fit_predict(rfm_scaled)

Analyze Cluster Profiles

cluster_profiles = rfm_df.groupby('Cluster').mean()
print(cluster_profiles)

# Visualize the clusters
sns.pairplot(rfm_df, hue='Cluster', palette='tab10')
plt.show()

Conclusion

With the above steps, we successfully segmented customers based on their purchasing behavior. These segments can further be analyzed for targeted marketing strategies or personalized customer service.

Sales Forecasting with Time Series

Below is the practical implementation of Sales Forecasting with Time Series in Python. We use the statsmodels library to build an ARIMA model for the sales forecasting.

Step 1: Import Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

Step 2: Prepare Time Series Data

Assumption: Your cleaned sales data is contained in a DataFrame named sales_data with columns date and sales.

# Ensure the date column is in datetime format
sales_data['date'] = pd.to_datetime(sales_data['date'])

# Set the date column as the index
sales_data.set_index('date', inplace=True)

# Resample the data by day, month, etc. Modify as per your requirement. Example: Monthly resampling
monthly_sales = sales_data.resample('M').sum()

Step 3: Split Data into Training and Testing Set

# Define the split point
train_size = int(len(monthly_sales) * 0.8)

# Split data
train, test = monthly_sales[:train_size], monthly_sales[train_size:]

Step 4: Build and Train the ARIMA Model

# Define the ARIMA model
model = SARIMAX(train['sales'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))

# Fit the model
model_fit = model.fit(disp=False)

Step 5: Make Forecast

# Forecast the sales
forecast = model_fit.forecast(steps=len(test))

# Convert forecast to DataFrame for easy comparison
forecast = pd.DataFrame(forecast, index=test.index, columns=['forecast'])

Step 6: Evaluate the Model

# Calculate Mean Absolute Error
mae = mean_absolute_error(test['sales'], forecast['forecast'])
print(f'Mean Absolute Error: {mae}')

Step 7: Visualize the Forecast

# Plot the results
plt.figure(figsize=(12, 6))

plt.plot(train.index, train['sales'], label='Train')
plt.plot(test.index, test['sales'], label='Test')
plt.plot(forecast.index, forecast['forecast'], label='Forecast', linestyle='dashed')

plt.title('Sales Forecast')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()

plt.show()

Conclusion

Now you have implemented the sales forecasting using a time series approach in Python. This can be integrated into your existing project in Google Colab as part of the comprehensive business intelligence project.

Visualizing Insights Using Dashboards

In this section, we'll create interactive dashboards to visualize our insights. We'll utilize the Plotly and Dash libraries in Python, which are well-suited for creating interactive visualizations. The following code provides a practical implementation for setting up a dashboard to visualize retail sales data.

Step 1: Install Required Libraries

Ensure you have Plotly and Dash installed in your Google Colab environment:

!pip install plotly dash

Step 2: Import Necessary Libraries

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd

# Assuming the cleaned and analyzed data is already stored in a DataFrame named 'df'
# df = ... (Your processed DataFrame)

Step 3: Setup the Dash Application

app = dash.Dash(__name__)

Step 4: Define Layout of the Dashboard

app.layout = html.Div([
    html.H1("Retail Sales Dashboard"),
    
    # Dropdown for selecting the type of analysis
    html.Label("Choose Analysis"),
    dcc.Dropdown(
        id='analysis-type',
        options=[
            {'label': 'Sales Trends Over Time', 'value': 'time'},
            {'label': 'Category-Level Sales', 'value': 'category'},
            {'label': 'Regional Sales', 'value': 'region'},
            {'label': 'Customer Segmentation', 'value': 'customer_seg'}
        ],
        value='time' # Default value
    ),

    # Graph to display the selected analysis
    dcc.Graph(id='main-graph')
])

Step 5: Define Callback to Update Graph Based on Dropdown Selection

@app.callback(
    Output('main-graph', 'figure'),
    Input('analysis-type', 'value')
)
def update_graph(analysis_type):
    if analysis_type == 'time':
        fig = px.line(df, x='date', y='sales', title='Sales Trends Over Time')
    elif analysis_type == 'category':
        fig = px.bar(df, x='category', y='sales', title='Category-Level Sales')
    elif analysis_type == 'region':
        fig = px.choropleth(df, locations='region', locationmode='country names',
                            color='sales', title='Regional Sales',
                            hover_name='region', color_continuous_scale=px.colors.sequential.Plasma)
    elif analysis_type == 'customer_seg':
        fig = px.pie(df, names='customer_segment', values='sales', title='Customer Segmentation')
    
    return fig

Step 6: Run the Dash Application

if __name__ == '__main__':
    app.run_server(debug=True, use_reloader=False)

Summary

Copy and paste the code segments into your Google Colab. This implementation provides an interactive dashboard for visualizing various aspects of the retail sales data. Modify the DataFrame variable df to use your processed data. Run the Dash application to visualize insights interactively on your browser.