Project
Comprehensive Guide to Data Preprocessing with Scikit-learn
This project aims to provide a step-by-step implementation guide for data preprocessing using Scikit-learn in Python. It covers everything from data cleaning to feature scaling, making it an essential resource for anyone looking to prepare data for machine learning models.
Comprehensive Guide to Data Preprocessing with Scikit-learn
Description
In this project, you will learn practical data preprocessing techniques using the Scikit-learn library. Each step focuses on a specific aspect of data transformation, from importing data to handling missing values, encoding categorical variables, and normalizing numerical features. By following this guide, you will be equipped with the necessary skills to clean and prepare your data efficiently, setting a solid foundation for any machine learning model.
The original prompt:
I want to create a comprehensive guide to Data Preprocessing with Scikit-learn
Step 1: Importing Necessary Libraries and Data
# Importing essential libraries for data preprocessing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Load your dataset
# Replace 'your_dataset.csv' with the path to your dataset file
data = pd.read_csv('your_dataset.csv')
# Display the first few rows of the dataset
print(data.head())
Instructions
Ensure you have
pandas
andscikit-learn
installed in your Python environment. If not, install them using:pip install pandas scikit-learn
Replace
'your_dataset.csv'
with the actual path to your dataset file.
This script imports the necessary libraries and reads a CSV file into a pandas DataFrame, displaying the first few rows of the data. This will set up the initial environment for your data preprocessing tasks.
Exploratory Data Analysis (EDA)
1. Analyzing Basic Statistical Properties
# Assuming data is loaded into 'df' DataFrame
print(df.describe()) # Summary statistics for numerical columns
2. Checking for Missing Values
print(df.isnull().sum()) # Count of missing values in each column
3. Data Type Information
print(df.info()) # Data types and non-null counts
4. Distribution of Numerical Features
import matplotlib.pyplot as plt
import seaborn as sns
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_columns:
plt.figure(figsize=(10, 6))
sns.histplot(df[col], kde=True)
plt.title(f'Distribution of {col}')
plt.show()
5. Distribution of Categorical Features
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
plt.figure(figsize=(10, 6))
sns.countplot(x=df[col])
plt.title(f'Count plot of {col}')
plt.xticks(rotation=45)
plt.show()
6. Correlation Matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
7. Pairplot for Numerical Data
sns.pairplot(df[numeric_columns])
plt.show()
8. Boxplots for Outlier Detection
for col in numeric_columns:
plt.figure(figsize=(10, 6))
sns.boxplot(x=df[col])
plt.title(f'Boxplot of {col}')
plt.show()
9. Adding New Features (if any)
df['NewFeature'] = df['ExistingFeature1'] * df['ExistingFeature2'] # Example of feature generation
10. Analyzing Relationship Between Features and Target
target_column = 'target'
for col in numeric_columns:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df[col], y=df[target_column])
plt.title(f'Relationship between {col} and {target_column}')
plt.show()
for col in categorical_columns:
plt.figure(figsize=(10, 6))
sns.boxplot(x=df[col], y=df[target_column])
plt.title(f'Relationship between {col} and {target_column}')
plt.xticks(rotation=45)
plt.show()
This concludes the EDA process to better understand the data before moving on to preprocessing and actual feature scaling necessary for machine learning models.
# Handling Missing Values
# Importing necessary library
from sklearn.impute import SimpleImputer
# Assuming `df` is your DataFrame
# Display missing values per column
print(df.isnull().sum())
# Strategy 1: Remove rows with missing values
df_cleaned = df.dropna()
# Strategy 2: Fill missing values with mean (for numerical data)
num_imputer = SimpleImputer(strategy='mean')
df[['numerical_column']] = num_imputer.fit_transform(df[['numerical_column']])
# Strategy 3: Fill missing values with most frequent value (for categorical data)
cat_imputer = SimpleImputer(strategy='most_frequent')
df[['categorical_column']] = cat_imputer.fit_transform(df[['categorical_column']])
# Alternative Strategy: Fill missing values with a fixed value
fixed_value_imputer = SimpleImputer(strategy='constant', fill_value='missing_value')
df[['another_column']] = fixed_value_imputer.fit_transform(df[['another_column']])
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Assuming df is your DataFrame and 'Category' is the categorical column you want to encode
df = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B']})
# Instantiate the OneHotEncoder
encoder = OneHotEncoder(sparse=False, drop='first') # drop='first' to avoid multicollinearity
# Fit and transform the categorical variable
encoded_categories = encoder.fit_transform(df[['Category']])
# Create a DataFrame from the encoded categories
encoded_df = pd.DataFrame(encoded_categories, columns=encoder.get_feature_names_out(['Category']))
# Concatenate the original DataFrame (without the original categorical column) with the encoded DataFrame
df = pd.concat([df.drop(columns=['Category']), encoded_df], axis=1)
# Display the encoded DataFrame
print(df)
This code assumes the pandas
and scikit-learn
libraries are already installed and a DataFrame named df
is available. The OneHotEncoder
transforms the 'Category' column into a one-hot encoded format, and the original DataFrame is updated to include the new encoded columns.
# Feature Scaling and Normalization
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Assuming 'data' is a pandas DataFrame already loaded and preprocessed
# Use the columns of your choice to scale, for example, ['feature1', 'feature2']
# Separate the features to be scaled
features_to_scale = ['feature1', 'feature2']
features = data[features_to_scale]
# Standard Scaling (Z-score Normalization)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Convert the scaled features back to a DataFrame
scaled_features_df = pd.DataFrame(scaled_features, columns=features_to_scale)
# Replace the original features with scaled features
data[features_to_scale] = scaled_features_df
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
normalized_features = min_max_scaler.fit_transform(features)
# Convert the normalized features back to a DataFrame
normalized_features_df = pd.DataFrame(normalized_features, columns=features_to_scale)
# Replace the original features with normalized features
data[features_to_scale] = normalized_features_df
# Now 'data' has its features scaled and normalized
Splitting Data into Training and Testing Sets
Assumptions
- You have already imported the necessary libraries.
- Your data is stored in a DataFrame
df
, and the label column istarget
.
Implementation
from sklearn.model_selection import train_test_split
# Separate features and target
X = df.drop(columns=['target'])
y = df['target']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Optional: Print shapes to verify the splits
print(f"Training data shape: {X_train.shape}, Training labels shape: {y_train.shape}")
print(f"Testing data shape: {X_test.shape}, Testing labels shape: {y_test.shape}")
Summary
- train_test_split: Splits data into training and testing sets.
- test_size=0.2: 20% of the data is used for testing.
- random_state=42: Ensures reproducibility.
# Outlier Detection and Removal
from sklearn.ensemble import IsolationForest
import numpy as np
import pandas as pd
# Assuming 'df' is your DataFrame and 'features_to_check' is a list of columns you want to check for outliers
features_to_check = ['feature1', 'feature2', 'feature3']
# Applying Isolation Forest for outlier detection
iso = IsolationForest(contamination=0.05) # Adjust contamination as required
pred = iso.fit_predict(df[features_to_check])
# Removing outliers
df['outlier'] = pred
df = df[df['outlier'] != -1]
# Dropping the temporary outlier column
df.drop('outlier', axis=1, inplace=True)
# df now contains the data with outliers removed
Feature Engineering
This section focuses on creating new features and transforming existing features to enhance the models' predictive performance.
1. Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
# Assuming X_train and X_test are already defined from previous steps
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
2. Interaction Features
# Interaction features create products of pairs of features
from sklearn.preprocessing import PolynomialFeatures
interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_interaction = interaction.fit_transform(X_train)
X_test_interaction = interaction.transform(X_test)
3. Binning
import numpy as np
import pandas as pd
# Binning numerical feature into categories
bins = [0, 10, 20, 30, 40, 50, 60]
labels = ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60']
X_train['binned_feature'] = pd.cut(X_train['numerical_column'], bins=bins, labels=labels)
X_test['binned_feature'] = pd.cut(X_test['numerical_column'], bins=bins, labels=labels)
4. Log Transformation
# Apply log transformation to reduce skewness
X_train['log_feature'] = np.log1p(X_train['skewed_feature'])
X_test['log_feature'] = np.log1p(X_test['skewed_feature'])
5. Date Features Extraction
# Extracting date features from a datetime column
X_train['year'] = X_train['datetime_column'].dt.year
X_train['month'] = X_train['datetime_column'].dt.month
X_train['day'] = X_train['datetime_column'].dt.day
X_train['dayofweek'] = X_train['datetime_column'].dt.dayofweek
X_test['year'] = X_test['datetime_column'].dt.year
X_test['month'] = X_test['datetime_column'].dt.month
X_test['day'] = X_test['datetime_column'].dt.day
X_test['dayofweek'] = X_test['datetime_column'].dt.dayofweek
6. Target Encoding
from category_encoders import TargetEncoder
# Encoding categorical variables with the mean of the target variable
target_encoder = TargetEncoder()
X_train['target_encoded_feature'] = target_encoder.fit_transform(X_train['categorical_column'], y_train)
X_test['target_encoded_feature'] = target_encoder.transform(X_test['categorical_column'])
7. Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif
# Select best features based on ANOVA F-value between label/feature
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
Ensure that each transformation is relevant and adds value to your machine learning model.
Part 9: Data Transformation and Pipelines
Step 1: Import Libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
Step 2: Define Data Transformation and Pipelines
Sample Data Preparation
# Assume `X` is the feature set and `y` is the target variable
# Example data (X)
# Note: Replace this with the actual data
import pandas as pd
data = {
'age': [25, 32, 47, 51],
'salary': [50000, 60000, 80000, 90000],
'gender': ['male', 'female', 'female', 'male'],
'dept': ['sales', 'engineering', 'engineering', 'hr']
}
X = pd.DataFrame(data)
y = [0, 1, 1, 0] # Example target variable
ColumnTransformer Setup
# Define which columns to apply transformations to
numeric_features = ['age', 'salary']
categorical_features = ['gender', 'dept']
# Construct column transformer
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
]
)
Pipeline Setup
from sklearn.linear_model import LogisticRegression
# Construct pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
Step 3: Apply Pipeline to Data
# Fit pipeline to the data
pipeline.fit(X, y)
# Predict using the pipeline
predictions = pipeline.predict(X)
Step 4: Evaluating the Pipeline
from sklearn.metrics import accuracy_score
# Making predictions and evaluating
y_pred = pipeline.predict(X)
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")
Full Implementable Code
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
# Sample data
data = {
'age': [25, 32, 47, 51],
'salary': [50000, 60000, 80000, 90000],
'gender': ['male', 'female', 'female', 'male'],
'dept': ['sales', 'engineering', 'engineering', 'hr']
}
X = pd.DataFrame(data)
y = [0, 1, 1, 0]
# Define column sets
numeric_features = ['age', 'salary']
categorical_features = ['gender', 'dept']
# Define data transformations
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
]
)
# Construct pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
# Fit and predict
pipeline.fit(X, y)
y_pred = pipeline.predict(X)
# Evaluate
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")
# Step 10: Saving and Loading Preprocessed Data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import joblib
# Assuming data has been preprocessed, split into X (features) and y (target)
# Using example data for completeness of this implementation
X, y = pd.DataFrame({'feature1': [0, 1, 2], 'feature2': [3, 4, 5]}), pd.Series([0, 1, 0])
# Splitting the data again for completeness of implementation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Example of feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Saving the preprocessed data
joblib.dump(X_train_scaled, 'X_train_scaled.pkl')
joblib.dump(X_test_scaled, 'X_test_scaled.pkl')
joblib.dump(y_train, 'y_train.pkl')
joblib.dump(y_test, 'y_test.pkl')
joblib.dump(scaler, 'scaler.pkl')
# Loading the preprocessed data
X_train_loaded = joblib.load('X_train_scaled.pkl')
X_test_loaded = joblib.load('X_test_scaled.pkl')
y_train_loaded = joblib.load('y_train.pkl')
y_test_loaded = joblib.load('y_test.pkl')
scaler_loaded = joblib.load('scaler.pkl')
# Verifying the loaded data
print(X_train_loaded)
print(X_test_loaded)
print(y_train_loaded)
print(y_test_loaded)
print(scaler_loaded)