Project

Customer Segmentation Analysis Using Python

Utilize data analysis and machine learning to segment customers based on purchasing behavior, demographics, and engagement levels. Derive actionable insights by applying clustering techniques in Python.

Empty image or helper icon

Customer Segmentation Analysis Using Python

Description

This project aims to perform customer segmentation by applying unsupervised machine learning techniques. Students will work through loading and cleaning a dataset, performing exploratory data analysis, reducing dimensionality using PCA, and applying clustering algorithms like K-Means or Hierarchical Clustering. The final outcome will be a comprehensive analysis with visualizations, presented in a Jupyter notebook, to interpret and demonstrate the segmentation process.

The original prompt:

Customer Segmentation Analysis Project Description: Perform customer segmentation using clustering techniques to identify different types of customers in a dataset. This project involves using unsupervised machine learning to group customers based on their purchasing behavior, demographic information, and engagement levels.

Tasks:

Load and clean the customer dataset. Perform exploratory data analysis to understand customer demographics and behavior. Use PCA (Principal Component Analysis) to reduce dimensionality. Apply clustering algorithms like K-Means or Hierarchical Clustering to segment customers. Analyze the characteristics of each customer group. Visualize the clusters to interpret and present the segmentation. Expected Outcome: A Jupyter notebook that includes data cleaning, exploratory analysis, dimensionality reduction, clustering, and comprehensive visualizations that demonstrate the customer segmentation process.

Data Loading and Cleaning in Python

Setup Instructions

  1. Install Libraries: Ensure you have the necessary libraries installed. You can install them using pip if needed.
    pip install pandas numpy

Implementation

1. Loading the Data

import pandas as pd

# Load the CSV data into a DataFrame
file_path = 'path_to_your_file.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())

2. Exploring and Cleaning the Data

# Exploring basic information about the DataFrame
print(df.info())
print(df.describe())

# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

# Handling missing values
# Option 1: Drop rows with missing values
df_cleaned = df.dropna()

# Option 2: Fill missing values with the mean of the column (for numerical data)
df_cleaned = df.fillna(df.mean())

# Handling categorical variables (if any)
# Convert categorical variables to dummy/indicator variables
df_cleaned = pd.get_dummies(df_cleaned, drop_first=True)

# Output cleaned data information
print(df_cleaned.info())

3. Normalize the Data (Optional)

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Assume 'cols_to_scale' contains the names of the columns to be scaled
cols_to_scale = ['Column1', 'Column2', 'Column3']
df_cleaned[cols_to_scale] = scaler.fit_transform(df_cleaned[cols_to_scale])

# Display the first few rows of the cleaned and scaled DataFrame
print(df_cleaned.head())

This implementation includes loading data from a CSV file, exploring the data, cleaning it by handling missing values and categorical variables, and optionally normalizing numerical features. This prepares the data for subsequent analysis and machine learning tasks.

Exploratory Data Analysis (EDA) and Clustering Implementation

1. Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

2. Load Clean Data

# Assuming data is loaded and cleaned in df
# df = pd.read_csv('cleaned_customer_data.csv')  # Example of how you might load the data

# Display first few rows to inspect the dataframe
print(df.head())

3. Statistical Summary

# Summary statistics of the dataset
print(df.describe())
print(df.info())

4. Visualize Data Distributions

# Visualizing distributions of numerical features
num_columns = df.select_dtypes(include=['float64', 'int64']).columns
df[num_columns].hist(bins=15, figsize=(15, 10), layout=(5, 3))
plt.tight_layout()
plt.show()

5. Visualize Relationships

# Pairplot of numerical features
sns.pairplot(df[num_columns])
plt.show()

# Correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

6. Preprocessing for Clustering

# Scaling the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[num_columns])

# Visualizing scaled data distributions
scaled_df = pd.DataFrame(scaled_data, columns=num_columns)
scaled_df.hist(bins=15, figsize=(15, 10), layout=(5, 3))
plt.tight_layout()
plt.show()

7. Determine Optimal Number of Clusters

# Using the elbow method to find the optimal number of clusters
sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    sse.append(kmeans.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.title('Elbow Method For Optimal k')
plt.show()

# Using silhouette score to validate
sil_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    clusters = kmeans.predict(scaled_data)
    sil_scores.append(silhouette_score(scaled_data, clusters))

plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), sil_scores, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores For Optimal k')
plt.show()

8. Fit K-Means and Assign Cluster Labels

# Fit the KMeans algorithm based on the optimal number of clusters found
optimal_clusters = 4  # Assume we found 4 as the optimal number from previous steps

kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_data)

# Inspect the cluster assignments
print(df['Cluster'].value_counts())

9. Derive Actionable Insights

# Visualize cluster centers of the numerical features
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_df = pd.DataFrame(cluster_centers, columns=num_columns)
print(cluster_df)

# Visualizing Cluster Distributions
for column in num_columns:
    plt.figure(figsize=(8, 4))
    sns.boxplot(x='Cluster', y=column, data=df)
    plt.title(f'Cluster vs {column}')
    plt.show()

10. Conclusion

With the clusters derived, analyze the characteristics of each cluster to identify patterns in purchasing behavior, demographics, and engagement levels. Use these insights to tailor marketing strategies, product offerings, and customer engagement plans.

This practical implementation provides a complete EDA and clustering analysis process using Python. You can now apply this to your cleaned dataset to gain actionable insights.

Handling Missing Data and Outliers

Import Required Libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

Example Data Loading (Assuming DataFrame df is already loaded)

# Sample loading code if needed
# df = pd.read_csv('your_data.csv')

Handling Missing Data

# Identify missing values
missing_data = df.isnull().sum()
print("Missing data per column:\n", missing_data)

# Drop columns with a high percentage of missing values if needed
threshold = 0.5  # example threshold
df = df[df.columns[df.isnull().mean() < threshold]]

# Impute missing values
# Numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Categorical columns
categorical_cols = df.select_dtypes(include=[object]).columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

Identifying and Handling Outliers

# Detect outliers using the IQR method
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1

outliers = ((df[numeric_cols] < (Q1 - 1.5 * IQR)) | (df[numeric_cols] > (Q3 + 1.5 * IQR))).any(axis=1)

# Remove outliers
df = df[~outliers]

Data Normalization

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

K-Means Clustering

# Choose the number of clusters
k = 5

kmeans = KMeans(n_clusters=k, random_state=42)
df['Cluster'] = kmeans.fit_predict(df[numeric_cols])

Visualization of Clusters

# Example visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(x='feature_x', y='feature_y', hue='Cluster', data=df, palette='Set1', legend='full')
plt.title('Customer Segmentation')
plt.show()

Deriving Actionable Insights

# Example: Calculating average values in each cluster
cluster_insights = df.groupby('Cluster').mean()
print(cluster_insights)

This implementation handles missing data by imputing values, removes outliers using the IQR method, normalizes the data, and segments customers using K-Means clustering, providing a visualization of the results along with some basic cluster insights.

# Importing necessary libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Assuming `data` is your cleaned dataframe containing purchasing behavior, demographics, and engagement levels
# Separate the features from the target variable, if you have one
features = data.drop(columns=['target'], errors='ignore')  # Drop target column if it exists

# Standardizing the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Applying PCA
pca = PCA(n_components=2)  # Reducing to 2 dimensions for visualization purposes, modify as needed
principal_components = pca.fit_transform(scaled_features)

# Creating a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Adding the target variable back to the dataframe, if it exists
if 'target' in data.columns:
    pca_df = pd.concat([pca_df, data['target'].reset_index(drop=True)], axis=1)

# Proceed to clustering
from sklearn.cluster import KMeans

# Defining the KMeans model
kmeans = KMeans(n_clusters=3)  # Adjust number of clusters as needed
kmeans.fit(pca_df[['PC1', 'PC2']])

# Adding cluster labels to the PCA dataframe
pca_df['Cluster'] = kmeans.labels_

# Deriving actionable insights via cluster analysis
# You can group by clusters and analyze the means or other statistics of each original feature
cluster_insights = data.copy()
cluster_insights['Cluster'] = kmeans.labels_
cluster_summary = cluster_insights.groupby('Cluster').mean()

# Displaying the cluster summary for actionable insights
print(cluster_summary)
  • Feature Scaling: Standardize the features which is a necessary step before applying PCA.
  • PCA: Perform Principal Component Analysis to reduce the number of dimensions.
  • Clustering: Use KMeans clustering on the principal components to segment the customers.
  • Insights: Group data by clusters and compute summary statistics for actionable insights.

Make sure to adjust the number of principal components and clusters according to your specific project needs.

Part 5: Introduction to Clustering Algorithms

Imports and Data Preparation

Ensure you have the essential libraries imported and data ready for clustering analysis. Below is a sample data preparation step assuming the data has been cleaned and transformed adequately as per previous parts.

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the cleaned and pre-processed dataset
data = pd.read_csv('cleaned_data.csv')

# Select the features for clustering
features = ['purchasing_behavior', 'demographics', 'engagement_levels']

# Feature scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[features])

K-Means Clustering

Implement K-Means Algorithm

K-Means is one of the most commonly used clustering algorithms. Below is the implementation using the K-Means method from the scikit-learn library.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Define the number of clusters
num_clusters = 5

# Create KMeans model
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model to the scaled features
kmeans.fit(scaled_features)

# Predict the clusters
clusters = kmeans.predict(scaled_features)

# Add cluster labels to the original dataset
data['Cluster'] = clusters

Optimal Number of Clusters: Elbow Method

Using the Elbow Method to determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS).

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(scaled_features)
    wcss.append(kmeans.inertia_)

# Plot the Elbow graph
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11, wcss))
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Hierarchical Clustering

Dendrogram for Hierarchical Clustering

Hierarchical clustering can work well for smaller datasets. We use scipy's dendrogram to determine the number of clusters.

import scipy.cluster.hierarchy as shc

plt.figure(figsize=(10, 7))
plt.title("Customer Dendrograms")
dend = shc.dendrogram(shc.linkage(scaled_features, method='ward'))

plt.axhline(y=6, color='r', linestyle='--')
plt.show()

Applying Hierarchical Clustering

Using the Agglomerative Clustering from scikit-learn to determine the clusters.

from sklearn.cluster import AgglomerativeClustering

# Create the model
hc = AgglomerativeClustering(n_clusters=num_clusters, affinity='euclidean', linkage='ward')

# Fit the model to the scaled features
labels = hc.fit_predict(scaled_features)

# Add cluster labels to the original dataset
data['Cluster_HC'] = labels

Analyzing and Interpreting Cluster Results

Summarize the clusters to gain actionable insights.

# Display the first few rows of the dataset with cluster labels
print(data.head())

# Calculate the mean values of each feature for each cluster
cluster_means = data.groupby('Cluster').mean()
print(cluster_means)

# If needed, visualize the cluster distribution
import seaborn as sns

# Scatter plot for visualizing clusters (example with two dimensions)
sns.scatterplot(x='purchasing_behavior', y='engagement_levels', hue='Cluster', data=data, palette='viridis')
plt.title('Cluster Analysis')
plt.show()

This implementation segment customers into distinct groups based on their purchasing behavior, demographics, and engagement levels. The clustering insights can inform targeted marketing strategies or personalized customer experiences.

Customer Segmentation with K-Means

Below is the practical implementation of customer segmentation using K-Means clustering in Python. This assumes you have already completed data loading and cleaning, exploratory data analysis, handling missing data and outliers, and dimensionality reduction with PCA.

Step 1: Import Necessary Libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, pairwise_distances_argmin_min
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Standardize the Data

Ensure your dataset (let's call it data) is scaled since K-Means clustering is affected by the scale of the features.

# Assuming `data` is your DataFrame after PCA or other preprocessing
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

Step 3: Determine the Optimal Number of Clusters

Use the Elbow method and Silhouette score to determine the optimal number of clusters.

# Elbow method
sse = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    sse.append(kmeans.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), sse, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

# Silhouette score
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(scaled_data)
    silhouette_scores.append(silhouette_score(scaled_data, labels))

plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.title('Silhouette Scores')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()

Step 4: Apply K-Means Clustering

Choose the optimal number of clusters (let's say k=4 from the Elbow and Silhouette method results).

optimal_clusters = 4
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(scaled_data)

# Add cluster labels to the original data
data['Cluster'] = cluster_labels

Step 5: Analyze the Clusters

Derive insights from the clustered data.

# Visualizing the clusters
sns.pairplot(data, hue='Cluster', palette='viridis')
plt.show()

# Summary Statistics of clusters
cluster_summary = data.groupby('Cluster').mean()
print(cluster_summary)

Step 6: Actionable Insights

Translate the clustering findings into actionable insights.

# Suppose we have demographic features like 'Age' and 'Annual Income'
cluster_insights = data.groupby('Cluster').agg({
    'Age': ['mean', 'median'],
    'Annual Income': ['mean', 'median'],
    # Include other relevant features
})
print(cluster_insights)

Conclusion

By following these steps, you can segment customers based on their purchasing behavior, demographics, and engagement levels and derive actionable insights from the clustering analysis. Incorporate these insights into improving targeted marketing strategies, customer retention programs, and overall business decision-making.

Remember to investigate your clusters deeply to understand the specific characteristics and needs of each group.

Customer Segmentation with Hierarchical Clustering

Step 1: Import Necessary Libraries

import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Load and Prepare Data

Note: As mentioned, data loading, cleaning, and preprocessing are assumed to be completed in earlier sections of your project.

# Assume df is the DataFrame after preprocessing steps
df = pd.read_csv('processed_customer_data.csv')

# Selecting relevant features for clustering
features = ['PurchaseAmount', 'Age', 'EngagementScore']
X = df[features].values

Step 3: Perform Hierarchical Clustering

# Computing the hierarchical clustering using Ward's method
Z = linkage(X, method='ward')

Step 4: Plot Dendrogram

plt.figure(figsize=(10, 7))
plt.title("Customer Dendrogram")
dendrogram(Z)
plt.xlabel('Customer')
plt.ylabel('Euclidean distances')
plt.show()

Step 5: Determine the Optimal Number of Clusters

# Determine the number of clusters by setting a distance threshold
max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')

# Alternatively, specifying the number of clusters directly
k = 4
clusters_k = fcluster(Z, k, criterion='maxclust')

# Add cluster labels to the original DataFrame
df['Cluster'] = clusters_k

Step 6: Analyze and Visualize the Segmentation

# Grouping data by clusters to interpret the results
cluster_summary = df.groupby('Cluster').mean()

# Visualizing the clusters
cluster_summary.plot(kind='bar', figsize=(10, 6))
plt.title("Cluster Summary")
plt.ylabel("Average Value")
plt.xlabel("Cluster")
plt.show()

# Visualizing clusters in a scatter plot of two features
sns.scatterplot(data=df, x='PurchaseAmount', y='Age', hue='Cluster', palette='Set2')
plt.title("Customer Segments")
plt.show()

Step 7: Derive Actionable Insights

# Assuming we output cluster summaries for business interpretation
print(cluster_summary)

# Example insights could be derived from mean values of each cluster
for cluster in cluster_summary.index:
    print(f"Cluster {cluster}:")
    print(cluster_summary.loc[cluster])
    print("\n")

Step 8: Save the Segmented Data

# Save the DataFrame with cluster labels
df.to_csv('customer_segments.csv', index=False)

This implementation clusters customers into segments based on their purchasing behavior, demographics, and engagement levels using Hierarchical Clustering, then visualizes and analyzes the resulting segments for actionable insights.

Interpreting and Visualizing Clusters

This section focuses on interpreting and visualizing clusters after applying clustering techniques such as K-Means or Hierarchical Clustering. This implementation will help derive actionable insights based on customer segmentation.

Step 1: Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

Step 2: Fit the Clustering Algorithm

Assuming the clustering model (e.g., K-Means) has been fitted already:

# Assuming `data` is your preprocessed dataframe and `kmeans` is the KMeans object
kmeans = KMeans(n_clusters=5, random_state=42)
data['Cluster'] = kmeans.fit_predict(data)

Alternatively, if Hierarchical Clustering was used, you can fit and predict similarly using the appropriate sklearn object.

Step 3: Interpret Clusters - Profiling

# Calculate mean values of features for each cluster
cluster_profile = data.groupby('Cluster').mean()

# Display the cluster profile
print(cluster_profile)

Step 4: Visualize Clusters

Using PCA for Dimensionality Reduction

pca = PCA(n_components=2)
data_pca = pca.fit_transform(data.drop('Cluster', axis=1))

# Create a DataFrame for visualization
pca_df = pd.DataFrame(data_pca, columns=['PCA1', 'PCA2'])
pca_df['Cluster'] = data['Cluster']

# Visualize using Seaborn
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', data=pca_df, palette='viridis')
plt.title('Customer Segments Visualization with PCA')
plt.show()

Visualizing Feature Importance for Each Cluster

# Melt the dataframe for better visualization
cluster_profile_melted = cluster_profile.reset_index().melt(id_vars='Cluster', var_name='Feature', value_name='Mean')

plt.figure(figsize=(15, 8))
sns.barplot(data=cluster_profile_melted, x='Feature', y='Mean', hue='Cluster', palette='viridis')
plt.title('Feature Importance Across Clusters')
plt.xticks(rotation=45)
plt.show()

Visualizing Clusters Using Pairplot

# This can be computationally expensive with many features
# Using `hue` to visualize clusters in feature pairs

sns.pairplot(data, hue='Cluster', palette='viridis')
plt.suptitle('Cluster Pairplot', y=1.02)
plt.show()

By following these steps, you can effectively interpret and visualize the clusters of your customer segmentation analysis, allowing you to derive actionable insights regarding purchasing behavior, demographics, and engagement levels.