
Comprehensive Guide to Data Preprocessing with Scikit-learn

This project aims to provide a step-by-step implementation guide for data preprocessing using Scikit-learn in Python. It covers everything from data cleaning to feature scaling, making it an essential resource for anyone looking to prepare data for machine learning models.

In this project, you will learn practical data preprocessing techniques using the Scikit-learn library. Each step focuses on a specific aspect of data transformation, from importing data to handling missing values, encoding categorical variables, and normalizing numerical features. By following this guide, you will be equipped with the necessary skills to clean and prepare your data efficiently, setting a solid foundation for any machine learning model.

The original prompt:

I want to create a comprehensive guide to Data Preprocessing with Scikit-learn

Step 1: Importing Necessary Libraries and Data

# Importing essential libraries for data preprocessing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Load your dataset
# Replace 'your_dataset.csv' with the path to your dataset file
data = pd.read_csv('your_dataset.csv')

# Display the first few rows of the dataset


  1. Ensure you have pandas and scikit-learn installed in your Python environment. If not, install them using:

    pip install pandas scikit-learn
  2. Replace 'your_dataset.csv' with the actual path to your dataset file.

This script imports the necessary libraries and reads a CSV file into a pandas DataFrame, displaying the first few rows of the data. This will set up the initial environment for your data preprocessing tasks.

Exploratory Data Analysis (EDA)

1. Analyzing Basic Statistical Properties

# Assuming data is loaded into 'df' DataFrame
print(df.describe())  # Summary statistics for numerical columns

2. Checking for Missing Values

print(df.isnull().sum())  # Count of missing values in each column

3. Data Type Information

print(  # Data types and non-null counts

4. Distribution of Numerical Features

import matplotlib.pyplot as plt
import seaborn as sns

numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

for col in numeric_columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')

5. Distribution of Categorical Features

categorical_columns = df.select_dtypes(include=['object']).columns

for col in categorical_columns:
    plt.figure(figsize=(10, 6))
    plt.title(f'Count plot of {col}')

6. Correlation Matrix

plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')

7. Pairplot for Numerical Data


8. Boxplots for Outlier Detection

for col in numeric_columns:
    plt.figure(figsize=(10, 6))
    plt.title(f'Boxplot of {col}')

9. Adding New Features (if any)

df['NewFeature'] = df['ExistingFeature1'] * df['ExistingFeature2']  # Example of feature generation

10. Analyzing Relationship Between Features and Target

target_column = 'target'

for col in numeric_columns:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=df[col], y=df[target_column])
    plt.title(f'Relationship between {col} and {target_column}')

for col in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.boxplot(x=df[col], y=df[target_column])
    plt.title(f'Relationship between {col} and {target_column}')

This concludes the EDA process to better understand the data before moving on to preprocessing and actual feature scaling necessary for machine learning models.

# Handling Missing Values

# Importing necessary library
from sklearn.impute import SimpleImputer

# Assuming `df` is your DataFrame
# Display missing values per column

# Strategy 1: Remove rows with missing values
df_cleaned = df.dropna()

# Strategy 2: Fill missing values with mean (for numerical data)
num_imputer = SimpleImputer(strategy='mean')
df[['numerical_column']] = num_imputer.fit_transform(df[['numerical_column']])

# Strategy 3: Fill missing values with most frequent value (for categorical data)
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['categorical_column']] = cat_imputer.fit_transform(df[['categorical_column']])

# Alternative Strategy: Fill missing values with a fixed value
fixed_value_imputer = SimpleImputer(strategy='constant', fill_value='missing_value')
df[['another_column']] = fixed_value_imputer.fit_transform(df[['another_column']])
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Assuming df is your DataFrame and 'Category' is the categorical column you want to encode
df = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B']})

# Instantiate the OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first') # drop='first' to avoid multicollinearity

# Fit and transform the categorical variable
encoded_categories = encoder.fit_transform(df[['Category']])

# Create a DataFrame from the encoded categories
encoded_df = pd.DataFrame(encoded_categories, columns=encoder.get_feature_names_out(['Category']))

# Concatenate the original DataFrame (without the original categorical column) with the encoded DataFrame
df = pd.concat([df.drop(columns=['Category']), encoded_df], axis=1)

# Display the encoded DataFrame

This code assumes the pandas and scikit-learn libraries are already installed and a DataFrame named df is available. The OneHotEncoder transforms the 'Category' column into a one-hot encoded format, and the original DataFrame is updated to include the new encoded columns.

# Feature Scaling and Normalization

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Assuming 'data' is a pandas DataFrame already loaded and preprocessed
# Use the columns of your choice to scale, for example, ['feature1', 'feature2']

# Separate the features to be scaled
features_to_scale = ['feature1', 'feature2']
features = data[features_to_scale]

# Standard Scaling (Z-score Normalization)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Convert the scaled features back to a DataFrame
scaled_features_df = pd.DataFrame(scaled_features, columns=features_to_scale)

# Replace the original features with scaled features
data[features_to_scale] = scaled_features_df

# Min-Max Normalization
min_max_scaler = MinMaxScaler()
normalized_features = min_max_scaler.fit_transform(features)

# Convert the normalized features back to a DataFrame
normalized_features_df = pd.DataFrame(normalized_features, columns=features_to_scale)

# Replace the original features with normalized features
data[features_to_scale] = normalized_features_df

# Now 'data' has its features scaled and normalized

Splitting Data into Training and Testing Sets


  • You have already imported the necessary libraries.
  • Your data is stored in a DataFrame df, and the label column is target.


from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop(columns=['target'])
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Optional: Print shapes to verify the splits
print(f"Training data shape: {X_train.shape}, Training labels shape: {y_train.shape}")
print(f"Testing data shape: {X_test.shape}, Testing labels shape: {y_test.shape}")


  • train_test_split: Splits data into training and testing sets.
  • test_size=0.2: 20% of the data is used for testing.
  • random_state=42: Ensures reproducibility.
# Outlier Detection and Removal

from sklearn.ensemble import IsolationForest
import numpy as np
import pandas as pd

# Assuming 'df' is your DataFrame and 'features_to_check' is a list of columns you want to check for outliers
features_to_check = ['feature1', 'feature2', 'feature3']

# Applying Isolation Forest for outlier detection
iso = IsolationForest(contamination=0.05)  # Adjust contamination as required
pred = iso.fit_predict(df[features_to_check])

# Removing outliers
df['outlier'] = pred
df = df[df['outlier'] != -1]

# Dropping the temporary outlier column
df.drop('outlier', axis=1, inplace=True)

# df now contains the data with outliers removed

Feature Engineering

This section focuses on creating new features and transforming existing features to enhance the models' predictive performance.

1. Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Assuming X_train and X_test are already defined from previous steps
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

2. Interaction Features

# Interaction features create products of pairs of features
from sklearn.preprocessing import PolynomialFeatures

interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_interaction = interaction.fit_transform(X_train)
X_test_interaction = interaction.transform(X_test)

3. Binning

import numpy as np
import pandas as pd

# Binning numerical feature into categories
bins = [0, 10, 20, 30, 40, 50, 60]
labels = ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60']
X_train['binned_feature'] = pd.cut(X_train['numerical_column'], bins=bins, labels=labels)
X_test['binned_feature'] = pd.cut(X_test['numerical_column'], bins=bins, labels=labels)

4. Log Transformation

# Apply log transformation to reduce skewness
X_train['log_feature'] = np.log1p(X_train['skewed_feature'])
X_test['log_feature'] = np.log1p(X_test['skewed_feature'])

5. Date Features Extraction

# Extracting date features from a datetime column
X_train['year'] = X_train['datetime_column'].dt.year
X_train['month'] = X_train['datetime_column'].dt.month
X_train['day'] = X_train['datetime_column']
X_train['dayofweek'] = X_train['datetime_column'].dt.dayofweek

X_test['year'] = X_test['datetime_column'].dt.year
X_test['month'] = X_test['datetime_column'].dt.month
X_test['day'] = X_test['datetime_column']
X_test['dayofweek'] = X_test['datetime_column'].dt.dayofweek

6. Target Encoding

from category_encoders import TargetEncoder

# Encoding categorical variables with the mean of the target variable
target_encoder = TargetEncoder()
X_train['target_encoded_feature'] = target_encoder.fit_transform(X_train['categorical_column'], y_train)
X_test['target_encoded_feature'] = target_encoder.transform(X_test['categorical_column'])

7. Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif

# Select best features based on ANOVA F-value between label/feature
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

Ensure that each transformation is relevant and adds value to your machine learning model.

Part 9: Data Transformation and Pipelines

Step 1: Import Libraries

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler

Step 2: Define Data Transformation and Pipelines

Sample Data Preparation

# Assume `X` is the feature set and `y` is the target variable
# Example data (X)
# Note: Replace this with the actual data
import pandas as pd

data = {
    'age': [25, 32, 47, 51],
    'salary': [50000, 60000, 80000, 90000],
    'gender': ['male', 'female', 'female', 'male'],
    'dept': ['sales', 'engineering', 'engineering', 'hr']

X = pd.DataFrame(data)
y = [0, 1, 1, 0]  # Example target variable

ColumnTransformer Setup

# Define which columns to apply transformations to
numeric_features = ['age', 'salary']
categorical_features = ['gender', 'dept']

# Construct column transformer
preprocessor = ColumnTransformer(
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)

Pipeline Setup

from sklearn.linear_model import LogisticRegression

# Construct pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())

Step 3: Apply Pipeline to Data

# Fit pipeline to the data, y)

# Predict using the pipeline
predictions = pipeline.predict(X)

Step 4: Evaluating the Pipeline

from sklearn.metrics import accuracy_score

# Making predictions and evaluating
y_pred = pipeline.predict(X)
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")

Full Implementable Code

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

# Sample data
data = {
    'age': [25, 32, 47, 51],
    'salary': [50000, 60000, 80000, 90000],
    'gender': ['male', 'female', 'female', 'male'],
    'dept': ['sales', 'engineering', 'engineering', 'hr']

X = pd.DataFrame(data)
y = [0, 1, 1, 0]

# Define column sets
numeric_features = ['age', 'salary']
categorical_features = ['gender', 'dept']

# Define data transformations
preprocessor = ColumnTransformer(
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)

# Construct pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())

# Fit and predict, y)
y_pred = pipeline.predict(X)

# Evaluate
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")
# Step 10: Saving and Loading Preprocessed Data

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib

# Assuming data has been preprocessed, split into X (features) and y (target)
# Using example data for completeness of this implementation
X, y = pd.DataFrame({'feature1': [0, 1, 2], 'feature2': [3, 4, 5]}), pd.Series([0, 1, 0])

# Splitting the data again for completeness of implementation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Example of feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Saving the preprocessed data
joblib.dump(X_train_scaled, 'X_train_scaled.pkl')
joblib.dump(X_test_scaled, 'X_test_scaled.pkl')
joblib.dump(y_train, 'y_train.pkl')
joblib.dump(y_test, 'y_test.pkl')
joblib.dump(scaler, 'scaler.pkl')

# Loading the preprocessed data
X_train_loaded = joblib.load('X_train_scaled.pkl')
X_test_loaded = joblib.load('X_test_scaled.pkl')
y_train_loaded = joblib.load('y_train.pkl')
y_test_loaded = joblib.load('y_test.pkl')
scaler_loaded = joblib.load('scaler.pkl')

# Verifying the loaded data