Project

Comprehensive HR Data Analysis in Python Using Google Colab

A hands-on project for analyzing HR datasets using Python in Google Colab. From data importation to advanced analytics, this project will cover all essential aspects.

Empty image or helper icon

Comprehensive HR Data Analysis in Python Using Google Colab

Description

This project aims to empower professionals with the skills needed to analyze HR datasets effectively. We'll use a hypothetical yet comprehensive HR dataset for a large multinational company. The project will guide you through a series of analysis steps including data importation, cleaning, exploration, visualization, and advanced analytics, providing you with valuable insights into HR metrics that can drive business decisions.

The original prompt:

Let's work through a detailed example of analyzing a HR dataset for a large multinational company in a Google Collab data notebook using Python.

The dataset is something you can make up but make it comprehensive. Let's work through a variety of different types of real-world analysis we can complete, and you can show the code we can use.

Imagine you are directly supporting work in the data notebook so be as detailed as possible and make sure the code actually will work on the first go.

Setting Up Google Colab and Installing Packages

Introduction

In this unit, you will learn how to set up Google Colab, an online environment that allows you to write and execute Python code in your browser. You will also learn how to install necessary packages that will be used for analyzing HR datasets.

Steps to Set Up Google Colab

1. Access Google Colab

  1. Open your web browser.
  2. Navigate to Google Colab.
  3. If you are not signed in, sign in with your Google account.

2. Create a New Notebook

  1. Once you are logged in, click on "File".
  2. From the dropdown menu, select "New Notebook". This will create a new Colab notebook.

3. Rename the Notebook

  1. Click on the title "Untitled" at the top left corner of the page.
  2. Rename it to something descriptive like "HR_Dataset_Analysis".

Installing Packages

To analyze HR datasets, you will need a few essential Python packages such as pandas for data manipulation and matplotlib for data visualization.

1. Install pandas Package

Below is the code to install the pandas package. This should be run in a code cell within your Colab notebook.

!pip install pandas

2. Install matplotlib Package

Similarly, you can install the matplotlib package using the code below.

!pip install matplotlib

3. Import the Packages

After installing the packages, you need to import them to use in your project. Add the following lines to your Colab notebook:

import pandas as pd
import matplotlib.pyplot as plt

Complete Setup Code Block

Here’s a complete code block to set up your Google Colab environment and install required packages:

# Install necessary packages
!pip install pandas
!pip install matplotlib

# Import installed packages
import pandas as pd
import matplotlib.pyplot as plt

# Test import by printing versions
print("Pandas version:", pd.__version__)
print("Matplotlib version:", plt.__version__)

Conclusion

You have now set up your Google Colab environment and installed the necessary packages to begin analyzing HR datasets. You are ready to import data and perform advanced analytics in the subsequent units of this project.

Make sure to save your notebook regularly by clicking on "File" then select "Save" or simply press Ctrl+S. Happy coding!

Data Importation and Initial Overview

1. Data Importation

Import Required Libraries

import pandas as pd

Load Dataset

Load a CSV file from Google Drive.

from google.colab import drive
drive.mount('/content/drive')

# Adjust the file path accordingly
file_path = '/content/drive/My Drive/dataset/hr_data.csv'
df = pd.read_csv(file_path)

2. Initial Overview

Display the First Few Rows

print("First 5 Rows of the Dataset:")
print(df.head())

Check the Shape of the Dataset

print("Shape of the Dataset (rows, columns):")
print(df.shape)

Display Column Names

print("Column Names:")
print(df.columns.tolist())

Data Types of Each Column

print("Data Types of Columns:")
print(df.dtypes)

Summary Statistics

print("Summary Statistics:")
print(df.describe(include='all'))

Check for Missing Values

print("Missing Values in Each Column:")
print(df.isnull().sum())

Check for Duplicates

print("Number of Duplicate Rows:")
print(df.duplicated().sum())

Summary

The code above handles:

  • Data importation from Google Drive
  • Displaying initial exploratory data analysis (EDA) including:
    • First few rows of the dataset
    • Shape of the dataset
    • Column names
    • Data types of each column
    • Summary statistics
    • Missing values check
    • Duplicate rows check

Execute each code block in sequence to comprehensively understand your HR dataset, paving the way for advanced analytics.

Data Cleaning and Preprocessing for HR Datasets

The goal is to clean and preprocess the HR dataset to make it ready for analysis. Let's focus on the following steps:

  1. Handling Missing Values
  2. Converting Data Types
  3. Handling Duplicates
  4. Feature Engineering
  5. Data Normalization/Standardization

1. Handling Missing Values

Replace missing values appropriately based on the column type and business logic.

# Import pandas and load HR dataset (assuming data_frame is your DataFrame)
import pandas as pd

# Fill missing numerical values with median
data_frame['salary'].fillna(data_frame['salary'].median(), inplace=True)

# Fill missing categorical values with mode
data_frame['department'].fillna(data_frame['department'].mode()[0], inplace=True)

2. Converting Data Types

Ensure all columns have the correct data types for analysis.

# Convert hire_date to datetime
data_frame['hire_date'] = pd.to_datetime(data_frame['hire_date'])

# Convert salary to float
data_frame['salary'] = data_frame['salary'].astype(float)

3. Handling Duplicates

Remove duplicate records if any.

# Remove duplicate rows
data_frame.drop_duplicates(subset=['employee_id'], inplace=True)

4. Feature Engineering

Create new features that may be useful for analysis.

# Create tenure feature (assuming today is the reference date)
data_frame['tenure'] = (pd.Timestamp.now() - data_frame['hire_date']).dt.days // 365

# Create a feature to indicate if salary is above median
data_frame['above_median_salary'] = data_frame['salary'] > data_frame['salary'].median()

5. Data Normalization/Standardization

Normalize or standardize numerical features if necessary for analysis.

from sklearn.preprocessing import StandardScaler

# Scale 'salary' and 'tenure'
scaler = StandardScaler()
data_frame[['salary', 'tenure']] = scaler.fit_transform(data_frame[['salary', 'tenure']])

Final Preprocessed Data Overview

Check the final state of the dataset to ensure it's ready for advanced analytics.

# Display final data structure and types
print(data_frame.info())
print(data_frame.head())

This implementation provides a hands-on guide for cleaning and preprocessing your HR dataset. You can proceed with advanced analytics in the subsequent parts of your project.

Exploratory Data Analysis (EDA)

Overview

EDA involves summarizing the main characteristics of a dataset often with visual methods. This will help in understanding the structure, patterns, and relationships in the data. We'll be using Python in Google Colab.

Steps for EDA

  1. Summary Statistics

    • Objective: Generate summary statistics for numerical and categorical features.
    • Implementation:
      # Assuming `df` is your DataFrame
      numerical_summary = df.describe()
      categorical_summary = df.describe(include=['object', 'category'])
      
      print("Numerical Summary:\n", numerical_summary)
      print("\nCategorical Summary:\n", categorical_summary)
  2. Missing Values Analysis

    • Objective: Identify and analyze missing values in the dataset.
    • Implementation:
      missing_values = df.isnull().sum()
      missing_ratio = df.isnull().mean()
      
      print("Missing Values:\n", missing_values)
      print("\nMissing Ratio:\n", missing_ratio)
  3. Data Distribution

    • Objective: Visualize the distribution of numerical features.
    • Implementation:
      import matplotlib.pyplot as plt
      import seaborn as sns
      
      numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
      
      for feature in numerical_features:
          plt.figure(figsize=(10, 6))
          sns.histplot(df[feature], kde=True)
          plt.title(f'Distribution of {feature}')
          plt.show()
  4. Correlation Matrix

    • Objective: Understand relationships between numerical variables.
    • Implementation:
      corr_matrix = df.corr()
      
      plt.figure(figsize=(12, 8))
      sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
      plt.title('Correlation Matrix')
      plt.show()
  5. Categorical Data Analysis

    • Objective: Analyze the distribution and relationship of categorical data.
    • Implementation:
      categorical_features = df.select_dtypes(include=['object', 'category']).columns
      
      for feature in categorical_features:
          plt.figure(figsize=(10, 6))
          sns.countplot(y=df[feature], order=df[feature].value_counts().index)
          plt.title(f'Distribution of {feature}')
          plt.show()
  6. Pair Plot

    • Objective: Visualize relationships between numerical variables using pair plots.
    • Implementation:
      sns.pairplot(df[numerical_features])
      plt.show()

Conclusion

This EDA will give you a comprehensive understanding of your HR dataset, which is crucial before moving on to advanced analytics.

Visualizing HR Metrics

Assuming that the data has already been imported, cleaned, and preprocessed, and exploratory data analysis has been conducted, let’s move on to visualizing HR metrics.

1. Import Necessary Libraries

Make sure you have the required libraries:

import matplotlib.pyplot as plt
import seaborn as sns

# Optional but recommended for larger datasets
import pandas as pd

2. Example HR Dataset

Consider you have a DataFrame named hr_data with columns such as employee_id, age, department, salary, years_at_company, satisfaction_level, performance_score, etc.

# For demonstration purposes, here's an example of what the DataFrame could look like
# hr_data = pd.read_csv('path_to_file.csv')

3. Visualization Examples

Employee Distribution by Department

plt.figure(figsize=(10, 6))
sns.countplot(data=hr_data, x='department', palette='viridis')
plt.title('Employee Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Number of Employees')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Employee Age Distribution

plt.figure(figsize=(10, 6))
sns.histplot(hr_data['age'], bins=20, kde=True, color='blue')
plt.title('Employee Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

Satisfaction Level by Department

plt.figure(figsize=(12, 8))
sns.boxplot(data=hr_data, x='department', y='satisfaction_level', palette='coolwarm')
plt.title('Satisfaction Level by Department')
plt.xlabel('Department')
plt.ylabel('Satisfaction Level')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Salary Distribution by Department

plt.figure(figsize=(12, 8))
sns.boxplot(data=hr_data, x='department', y='salary', palette='magma')
plt.title('Salary Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Salary')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Performance Score vs. Satisfaction Level

plt.figure(figsize=(10, 6))
sns.scatterplot(data=hr_data, x='performance_score', y='satisfaction_level', hue='department', palette='tab10', s=100)
plt.title('Performance Score vs. Satisfaction Level')
plt.xlabel('Performance Score')
plt.ylabel('Satisfaction Level')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

4. Conclusion

These plots are essential for understanding various HR metrics. They help visualize the data, making it easier to identify trends and patterns. You can generate these visualizations and customize them further to address specific questions or insights needed from your HR data. By running these examples in Google Colab, you should be able to derive meaningful insights effortlessly.

Analyzing Employee Demographics

Here we are going to perform the analysis on employee demographic data. This will include calculating essential statistics, analyzing distributions, and providing insights on various demographic metrics.

Prerequisite: Importing Necessary Libraries

Ensure you have the following libraries imported if not already done in the earlier sections.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: To display all columns of a DataFrame
pd.set_option('display.max_columns', None)

Step 1: Loading the Data

Assuming the dataset is named employee_data.csv and resides in your Google Drive.

from google.colab import drive
drive.mount('/content/drive')

# Load the data
file_path = '/content/drive/My Drive/employee_data.csv'
df = pd.read_csv(file_path)
df.head()

Step 2: Analyzing Basic Demographic Information

Let's start by getting an overview of the demographic variables such as age, gender, department, and education.

# Basic statistics for age
age_stats = df['age'].describe()
print("Age Statistics:\n", age_stats)

# Gender distribution
gender_distribution = df['gender'].value_counts()
print("\nGender Distribution:\n", gender_distribution)

# Department distribution
department_distribution = df['department'].value_counts()
print("\nDepartment Distribution:\n", department_distribution)

# Education level distribution
education_distribution = df['education'].value_counts()
print("\nEducation Level Distribution:\n", education_distribution)

Step 3: Visualizing Demographic Data

3.1 Age Distribution

plt.figure(figsize=(10, 6))
sns.histplot(df['age'], kde=True, bins=30, color='blue')
plt.title('Age Distribution of Employees')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

3.2 Gender Distribution

plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='gender', palette='Set2')
plt.title('Gender Distribution of Employees')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

3.3 Department Distribution

plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='department', palette='Set3')
plt.title('Department Distribution')
plt.xlabel('Department')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

3.4 Education Level Distribution

plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='education', palette='Set1')
plt.title('Education Level Distribution')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Step 4: Cross-Analyzing Demographic Data

4.1 Gender vs. Department

plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='department', hue='gender', palette='Set2')
plt.title('Gender Distribution Across Departments')
plt.xlabel('Department')
plt.ylabel('Count')
plt.legend(title='Gender')
plt.xticks(rotation=45)
plt.show()

4.2 Education Level vs. Age

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='education', y='age', palette='Set1')
plt.title('Age Distribution Across Education Levels')
plt.xlabel('Education Level')
plt.ylabel('Age')
plt.xticks(rotation=45)
plt.show()

Step 5: Generating Summary Report

Collecting the insights into a summary report.

summary_report = {
    "age_statistics": age_stats.to_dict(),
    "gender_distribution": gender_distribution.to_dict(),
    "department_distribution": department_distribution.to_dict(),
    "education_distribution": education_distribution.to_dict()
}

# Convert summary report to DataFrame for better visualization
summary_df = pd.DataFrame(summary_report)
summary_df

We have now analyzed the employee demographics by calculating key statistics and visualizing the data to glean insights. This should allow us to understand the demographic makeup of the workforce comprehensively.

Attrition Analysis

In this section, we'll use Python to perform attrition analysis to understand and predict why employees leave a company. You can leverage this to make data-driven decisions to improve employee retention.

1. Data Preparation

Assume you have already loaded and cleaned your dataset.

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Assume `df` is your preprocessed DataFrame
# Split the data into features and target
X = df.drop("Attrition", axis=1)
y = df["Attrition"]

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

2. Model Training

We'll use a RandomForestClassifier for the prediction.

# Initialize the RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

3. Model Evaluation

Evaluate the model using the test dataset.

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the predictions
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))

4. Feature Importance

Identify important features that contribute to attrition.

# Get feature importances
importance = model.feature_importances_
features = X.columns

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print(feature_importance_df)

5. Results Interpretation

Interpret the results to make business decisions.

  • The confusion matrix and classification report provide insights into the model's performance.
  • Accuracy score gives an overall idea of how well the model is performing.
  • The feature importance dataframe allows you to see which features have the most impact on employee attrition.
  • Use this information to delve deeper into the most influential features, such as job satisfaction, number of projects, and average working hours, and take corrective actions.

6. Conclusion

By understanding which features affect employee attrition and how accurately you can predict it, your organization can implement more effective retention strategies.

This hands-on tutorial demonstrated how to conduct an attrition analysis using machine learning models, evaluate the model performance, and extract insights to improve HR policies.

Performance Evaluation and Metrics

In the context of analyzing HR datasets, performance evaluation often involves measuring the efficacy of predictive models or assessing the health of employee metrics. Below is a practical implementation focusing on evaluating a predictive model for employee attrition using Python in Google Colab.

1. Importing Required Libraries

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

2. Load the preprocessed dataset

Assuming you have a DataFrame named df that has already been cleaned and preprocessed.

# Example: Load preprocessed data
df = pd.read_csv('preprocessed_hr_dataset.csv')

3. Splitting Data

from sklearn.model_selection import train_test_split

# Features and target variable
X = df.drop('Attrition', axis=1)  # Features
y = df['Attrition']               # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Train a Predictive Model

from sklearn.ensemble import RandomForestClassifier

# Instantiate the model
model = RandomForestClassifier(random_state=42)

# Train the model
model.fit(X_train, y_train)

5. Make Predictions

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

6. Performance Metrics Evaluation

  1. Accuracy

    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy: {accuracy:.2f}')
  2. Confusion Matrix

    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()
  3. Classification Report

    report = classification_report(y_test, y_pred)
    print('Classification Report:')
    print(report)
  4. ROC AUC Score and ROC Curve

    roc_auc = roc_auc_score(y_test, y_prob)
    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    
    plt.plot(fpr, tpr, label=f'ROC Curve (area = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc='best')
    plt.show()

7. Implementing Advanced Metrics

For more in-depth analysis, include metrics like Precision-Recall Curve, F1 Score, etc.

  1. F1 Score

    from sklearn.metrics import f1_score
    
    f1 = f1_score(y_test, y_pred)
    print(f'F1 Score: {f1:.2f}')
  2. Precision-Recall Curve

    from sklearn.metrics import precision_recall_curve
    
    precision, recall, _ = precision_recall_curve(y_test, y_prob)
    
    plt.plot(recall, precision)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall Curve')
    plt.show()

By utilizing these metrics, you can comprehensively evaluate the performance of your predictive models in your HR dataset analysis project.

Compensation and Benefits Analysis

In this section, we'll analyze the compensation and benefits data to gain insights into trends, distributions, and identify any potential disparities. We'll utilize DataFrames and visualization libraries available in Python within Google Colab.

Step 1: Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Load Data

Assume compensation_data.csv is the dataset containing the relevant information.

# Load dataset
comp_df = pd.read_csv('compensation_data.csv')

# Preview dataframe
comp_df.head()

Step 3: Descriptive Statistics

# Basic descriptive statistics
comp_df.describe()

# Distribution of Compensation Levels
plt.figure(figsize=(10, 6))
sns.histplot(comp_df['salary'], bins=30, kde=True)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()

Step 4: Compensation by Department and Job Title

# Average salary by department
avg_salary_dept = comp_df.groupby('department')['salary'].mean().reset_index()

# Plot average salary by department
plt.figure(figsize=(12, 8))
sns.barplot(data=avg_salary_dept, x='department', y='salary')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary')
plt.xticks(rotation=45)
plt.show()

# Average salary by job title
avg_salary_title = comp_df.groupby('job_title')['salary'].mean().reset_index()

# Plot average salary by job title
plt.figure(figsize=(12, 10))
sns.barplot(data=avg_salary_title, x='salary', y='job_title')
plt.title('Average Salary by Job Title')
plt.xlabel('Average Salary')
plt.ylabel('Job Title')
plt.show()

Step 5: Gender Pay Gap Analysis

# Average salary by gender
avg_salary_gender = comp_df.groupby('gender')['salary'].mean().reset_index()

# Plot average salary by gender
plt.figure(figsize=(8, 6))
sns.barplot(data=avg_salary_gender, x='gender', y='salary')
plt.title('Average Salary by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Salary')
plt.show()

Step 6: Benefits Analysis

Assuming the dataset has columns related to benefits like 'health_insurance', 'retirement_plan', 'paid_time_off', etc.

# Count of benefits offered
benefit_columns = ['health_insurance', 'retirement_plan', 'paid_time_off']
benefit_counts = comp_df[benefit_columns].sum().reset_index()
benefit_counts.columns = ['Benefit', 'Count']

# Plot benefits distribution
plt.figure(figsize=(10, 6))
sns.barplot(data=benefit_counts, x='Benefit', y='Count')
plt.title('Distribution of Benefits Offered')
plt.xlabel('Benefit')
plt.ylabel('Count')
plt.show()

Step 7: Compensation vs. Performance

Assume performance_score column exists.

# Scatter plot of salary vs performance score
plt.figure(figsize=(10, 6))
sns.scatterplot(data=comp_df, x='performance_score', y='salary')
plt.title('Salary vs Performance Score')
plt.xlabel('Performance Score')
plt.ylabel('Salary')
plt.show()

# Average salary by performance score
avg_salary_perf = comp_df.groupby('performance_score')['salary'].mean().reset_index()

# Plot average salary by performance score
plt.figure(figsize=(8, 6))
sns.barplot(data=avg_salary_perf, x='performance_score', y='salary')
plt.title('Average Salary by Performance Score')
plt.xlabel('Performance Score')
plt.ylabel('Average Salary')
plt.show()

Conclusion

This analysis helps identify trends and insights such as average compensation by department, job title, gender, benefits distribution, and correlation between compensation and performance. This structured approach enables a comprehensive understanding of compensation and benefits within the organization.

Predictive Modeling for Employee Attrition

You are now ready to create a predictive model based on the cleaned and preprocessed HR dataset. The goal is to predict whether an employee will leave the company (attrition).

Step 1: Import Required Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Step 2: Prepare the Dataset

Assuming your data is in a DataFrame named df, and the target variable (attrition) is a column named Attrition.

# Separate target variable
X = df.drop('Attrition', axis=1)
y = df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Build and Train the Model

Using Random Forest Classifier as an example:

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

Step 4: Evaluate the Model

# Predict on test data
y_pred = rf_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Step 5: Interpret Results

Use the classification_report and confusion_matrix to understand the performance of your model. The accuracy_score gives a quick metric of how well your model is doing.

# Check feature importance
feature_importances = pd.Series(rf_model.feature_importances_, index=X.columns)
print(feature_importances.sort_values(ascending=False))

Conclusion

This implementation covers the complete pipeline for predictive modeling of employee attrition using a Random Forest Classifier in Python. You can adjust the model or hyperparameters as necessary to improve performance.

Part #11: Advanced Analytics with Machine Learning

Step 1: Preparing the Dataset for Advanced Analytics

Ensure your dataset is loaded and preprocessed correctly. Assuming the dataset from the previous sections is already cleaned and ready:

# Assuming 'hr_data' is your cleaned and preprocessed Pandas DataFrame
import pandas as pd

# Load preprocessed data
# hr_data = pd.read_csv('preprocessed_hr_data.csv')

Step 2: Split Dataset into Features and Target Variable

Here, we'll consider 'Attrition' as the target variable for classification tasks:

# Target variable
X = hr_data.drop(columns=['Attrition'])
y = hr_data['Attrition']

Step 3: Train-Test Split

Split the data into training and testing sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Feature Scaling

Standardize the feature variables:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 5: Model Selection and Training

Let's use three different algorithms: Logistic Regression, Random Forest, and Gradient Boosting for this example.

Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

log_reg_pred = log_reg.predict(X_test_scaled)
print("Logistic Regression Classification Report:\n", classification_report(y_test, log_reg_pred))

Random Forest

from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier()
random_forest.fit(X_train_scaled, y_train)

random_forest_pred = random_forest.predict(X_test_scaled)
print("Random Forest Classification Report:\n", classification_report(y_test, random_forest_pred))

Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier

grad_boost = GradientBoostingClassifier()
grad_boost.fit(X_train_scaled, y_train)

grad_boost_pred = grad_boost.predict(X_test_scaled)
print("Gradient Boosting Classification Report:\n", classification_report(y_test, grad_boost_pred))

Step 6: Hyperparameter Tuning with GridSearchCV on the Best Model

Choose the best model based on initial results and perform hyperparameter tuning:

from sklearn.model_selection import GridSearchCV

# Example for hyperparameter tuning of Gradient Boosting
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}

grid_search = GridSearchCV(estimator=grad_boost, param_grid=param_grid, cv=3, verbose=2, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

best_grad_boost = grid_search.best_estimator_

best_grad_boost_pred = best_grad_boost.predict(X_test_scaled)
print("Best Gradient Boosting Classification Report:\n", classification_report(y_test, best_grad_boost_pred))

Step 7: Model Evaluation and Interpretation

Evaluate the best model's performance using confusion matrix and AUC-ROC:

from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, best_grad_boost_pred)
print("Confusion Matrix:\n", conf_matrix)

# AUC-ROC
y_pred_proba = best_grad_boost.predict_proba(X_test_scaled)[:,1]
auc_roc = roc_auc_score(y_test, y_pred_proba)
print("AUC-ROC Score:", auc_roc)

# Plotting ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, label=f'Gradient Boosting (area = {auc_roc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()

Conclusion

By following these steps, you should have successfully completed an advanced machine learning analysis on your HR dataset in Google Colab. The implementation provided includes model training, hyperparameter tuning, and evaluation, which helps in deriving meaningful insights and making data-driven decisions.

Reporting and Dashboard Creation

1. Import Necessary Libraries

Ensure you have the following libraries loaded. They are necessary for creating reports and dashboards in Google Colab.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import express as px
import plotly.graph_objects as go
from dash import Dash, html, dcc

2. Example Data Import

Assuming the HR dataset has been preprocessed, let's load the cleaned dataset.

df = pd.read_csv('cleaned_hr_dataset.csv')

3. Summary Report Generation

Create summary statistical reports using pandas.

summary = df.describe()
summary.to_csv('summary_report.csv')

4. Example Dashboard Layout

Use Dash for creating an interactive dashboard.

app = Dash(__name__)

app.layout = html.Div(children=[
    html.H1(children='HR Data Dashboard'),
    
    dcc.Graph(
        id='example-graph',
        figure=px.histogram(df, x='YearsAtCompany', title='Employees by Years At Company')
    ),
    
    dcc.Graph(
        id='attrition-rate',
        figure=px.pie(df, names='Attrition', title='Attrition Rate')
    ),
    
    dcc.Graph(
        id='dept-distribution',
        figure=px.bar(df, x='Department', y='EmployeeCount', title='Department Distribution')
    )
])

if __name__ == '__main__':
    app.run_server(debug=True)

5. Combining and Serving the Dashboard

Ensure the DataFrame manipulations and visualizations are cohesive and integrate them into a running dashboard.

# Additional Graphs and Components as Needed
app.layout = html.Div(children=[
    html.H1(children='HR Data Dashboard'),

    dcc.Tabs([
        dcc.Tab(label='Overview', children=[
            html.Div([
                dcc.Graph(
                    id='overview-bar',
                    figure=px.bar(df, x='JobRole', y='MonthlyIncome', title='Monthly Income by Job Role')
                ),
                
                dcc.Graph(
                    id='overview-pie',
                    figure=px.pie(df, names='Gender', title='Gender Distribution')
                )
            ])
        ]),

        dcc.Tab(label='Attrition Analysis', children=[
            html.Div([
                dcc.Graph(
                    id='attrition-histogram',
                    figure=px.histogram(df, x='Age', color='Attrition', barmode='group', title='Attrition by Age')
                ),
                
                dcc.Graph(
                    id='attrition-dept',
                    figure=px.bar(df, x='Department', y='AttritionRate', title='Attrition Rate by Department')
                )
            ])
        ]),
        
        # Additional tabs can be defined here
    ])
])

if __name__ == '__main__':
    app.run_server(debug=True, port=8050)

Additional Sections

You can expand with more plots and computations as needed and add them to the dashboard layout.

With this implementation, you should be able to run an interactive HR report and visualization dashboard directly in Google Colab or any local environment supporting Plotly and Dash. The provided code snippets cover fundamental aspects of reporting and dashboard creation from loading data to rendering interactive graphs.