Comprehensive HR Data Analysis in Python Using Google Colab
A hands-on project for analyzing HR datasets using Python in Google Colab. From data importation to advanced analytics, this project will cover all essential aspects.
Comprehensive HR Data Analysis in Python Using Google Colab
Description
This project aims to empower professionals with the skills needed to analyze HR datasets effectively. We'll use a hypothetical yet comprehensive HR dataset for a large multinational company. The project will guide you through a series of analysis steps including data importation, cleaning, exploration, visualization, and advanced analytics, providing you with valuable insights into HR metrics that can drive business decisions.
The original prompt:
Let's work through a detailed example of analyzing a HR dataset for a large multinational company in a Google Collab data notebook using Python.
The dataset is something you can make up but make it comprehensive. Let's work through a variety of different types of real-world analysis we can complete, and you can show the code we can use.
Imagine you are directly supporting work in the data notebook so be as detailed as possible and make sure the code actually will work on the first go.
In this unit, you will learn how to set up Google Colab, an online environment that allows you to write and execute Python code in your browser. You will also learn how to install necessary packages that will be used for analyzing HR datasets.
If you are not signed in, sign in with your Google account.
2. Create a New Notebook
Once you are logged in, click on "File".
From the dropdown menu, select "New Notebook". This will create a new Colab notebook.
3. Rename the Notebook
Click on the title "Untitled" at the top left corner of the page.
Rename it to something descriptive like "HR_Dataset_Analysis".
Installing Packages
To analyze HR datasets, you will need a few essential Python packages such as pandas for data manipulation and matplotlib for data visualization.
1. Install pandas Package
Below is the code to install the pandas package. This should be run in a code cell within your Colab notebook.
!pip install pandas
2. Install matplotlib Package
Similarly, you can install the matplotlib package using the code below.
!pip install matplotlib
3. Import the Packages
After installing the packages, you need to import them to use in your project. Add the following lines to your Colab notebook:
import pandas as pd
import matplotlib.pyplot as plt
Complete Setup Code Block
Here’s a complete code block to set up your Google Colab environment and install required packages:
# Install necessary packages
!pip install pandas
!pip install matplotlib
# Import installed packages
import pandas as pd
import matplotlib.pyplot as plt
# Test import by printing versions
print("Pandas version:", pd.__version__)
print("Matplotlib version:", plt.__version__)
Conclusion
You have now set up your Google Colab environment and installed the necessary packages to begin analyzing HR datasets. You are ready to import data and perform advanced analytics in the subsequent units of this project.
Make sure to save your notebook regularly by clicking on "File" then select "Save" or simply press Ctrl+S. Happy coding!
Data Importation and Initial Overview
1. Data Importation
Import Required Libraries
import pandas as pd
Load Dataset
Load a CSV file from Google Drive.
from google.colab import drive
drive.mount('/content/drive')
# Adjust the file path accordingly
file_path = '/content/drive/My Drive/dataset/hr_data.csv'
df = pd.read_csv(file_path)
2. Initial Overview
Display the First Few Rows
print("First 5 Rows of the Dataset:")
print(df.head())
Check the Shape of the Dataset
print("Shape of the Dataset (rows, columns):")
print(df.shape)
print("Missing Values in Each Column:")
print(df.isnull().sum())
Check for Duplicates
print("Number of Duplicate Rows:")
print(df.duplicated().sum())
Summary
The code above handles:
Data importation from Google Drive
Displaying initial exploratory data analysis (EDA) including:
First few rows of the dataset
Shape of the dataset
Column names
Data types of each column
Summary statistics
Missing values check
Duplicate rows check
Execute each code block in sequence to comprehensively understand your HR dataset, paving the way for advanced analytics.
Data Cleaning and Preprocessing for HR Datasets
The goal is to clean and preprocess the HR dataset to make it ready for analysis. Let's focus on the following steps:
Handling Missing Values
Converting Data Types
Handling Duplicates
Feature Engineering
Data Normalization/Standardization
1. Handling Missing Values
Replace missing values appropriately based on the column type and business logic.
# Import pandas and load HR dataset (assuming data_frame is your DataFrame)
import pandas as pd
# Fill missing numerical values with median
data_frame['salary'].fillna(data_frame['salary'].median(), inplace=True)
# Fill missing categorical values with mode
data_frame['department'].fillna(data_frame['department'].mode()[0], inplace=True)
2. Converting Data Types
Ensure all columns have the correct data types for analysis.
# Convert hire_date to datetime
data_frame['hire_date'] = pd.to_datetime(data_frame['hire_date'])
# Convert salary to float
data_frame['salary'] = data_frame['salary'].astype(float)
Create new features that may be useful for analysis.
# Create tenure feature (assuming today is the reference date)
data_frame['tenure'] = (pd.Timestamp.now() - data_frame['hire_date']).dt.days // 365
# Create a feature to indicate if salary is above median
data_frame['above_median_salary'] = data_frame['salary'] > data_frame['salary'].median()
5. Data Normalization/Standardization
Normalize or standardize numerical features if necessary for analysis.
from sklearn.preprocessing import StandardScaler
# Scale 'salary' and 'tenure'
scaler = StandardScaler()
data_frame[['salary', 'tenure']] = scaler.fit_transform(data_frame[['salary', 'tenure']])
Final Preprocessed Data Overview
Check the final state of the dataset to ensure it's ready for advanced analytics.
# Display final data structure and types
print(data_frame.info())
print(data_frame.head())
This implementation provides a hands-on guide for cleaning and preprocessing your HR dataset. You can proceed with advanced analytics in the subsequent parts of your project.
Exploratory Data Analysis (EDA)
Overview
EDA involves summarizing the main characteristics of a dataset often with visual methods. This will help in understanding the structure, patterns, and relationships in the data. We'll be using Python in Google Colab.
Steps for EDA
Summary Statistics
Objective: Generate summary statistics for numerical and categorical features.
Implementation:
# Assuming `df` is your DataFrame
numerical_summary = df.describe()
categorical_summary = df.describe(include=['object', 'category'])
print("Numerical Summary:\n", numerical_summary)
print("\nCategorical Summary:\n", categorical_summary)
Missing Values Analysis
Objective: Identify and analyze missing values in the dataset.
Objective: Visualize the distribution of numerical features.
Implementation:
import matplotlib.pyplot as plt
import seaborn as sns
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
for feature in numerical_features:
plt.figure(figsize=(10, 6))
sns.histplot(df[feature], kde=True)
plt.title(f'Distribution of {feature}')
plt.show()
Correlation Matrix
Objective: Understand relationships between numerical variables.
Objective: Analyze the distribution and relationship of categorical data.
Implementation:
categorical_features = df.select_dtypes(include=['object', 'category']).columns
for feature in categorical_features:
plt.figure(figsize=(10, 6))
sns.countplot(y=df[feature], order=df[feature].value_counts().index)
plt.title(f'Distribution of {feature}')
plt.show()
Pair Plot
Objective: Visualize relationships between numerical variables using pair plots.
Implementation:
sns.pairplot(df[numerical_features])
plt.show()
Conclusion
This EDA will give you a comprehensive understanding of your HR dataset, which is crucial before moving on to advanced analytics.
Visualizing HR Metrics
Assuming that the data has already been imported, cleaned, and preprocessed, and exploratory data analysis has been conducted, let’s move on to visualizing HR metrics.
1. Import Necessary Libraries
Make sure you have the required libraries:
import matplotlib.pyplot as plt
import seaborn as sns
# Optional but recommended for larger datasets
import pandas as pd
2. Example HR Dataset
Consider you have a DataFrame named hr_data with columns such as employee_id, age, department, salary, years_at_company, satisfaction_level, performance_score, etc.
# For demonstration purposes, here's an example of what the DataFrame could look like
# hr_data = pd.read_csv('path_to_file.csv')
3. Visualization Examples
Employee Distribution by Department
plt.figure(figsize=(10, 6))
sns.countplot(data=hr_data, x='department', palette='viridis')
plt.title('Employee Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Number of Employees')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
These plots are essential for understanding various HR metrics. They help visualize the data, making it easier to identify trends and patterns. You can generate these visualizations and customize them further to address specific questions or insights needed from your HR data. By running these examples in Google Colab, you should be able to derive meaningful insights effortlessly.
Analyzing Employee Demographics
Here we are going to perform the analysis on employee demographic data. This will include calculating essential statistics, analyzing distributions, and providing insights on various demographic metrics.
Prerequisite: Importing Necessary Libraries
Ensure you have the following libraries imported if not already done in the earlier sections.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Optional: To display all columns of a DataFrame
pd.set_option('display.max_columns', None)
Step 1: Loading the Data
Assuming the dataset is named employee_data.csv and resides in your Google Drive.
from google.colab import drive
drive.mount('/content/drive')
# Load the data
file_path = '/content/drive/My Drive/employee_data.csv'
df = pd.read_csv(file_path)
df.head()
Step 2: Analyzing Basic Demographic Information
Let's start by getting an overview of the demographic variables such as age, gender, department, and education.
# Basic statistics for age
age_stats = df['age'].describe()
print("Age Statistics:\n", age_stats)
# Gender distribution
gender_distribution = df['gender'].value_counts()
print("\nGender Distribution:\n", gender_distribution)
# Department distribution
department_distribution = df['department'].value_counts()
print("\nDepartment Distribution:\n", department_distribution)
# Education level distribution
education_distribution = df['education'].value_counts()
print("\nEducation Level Distribution:\n", education_distribution)
Step 3: Visualizing Demographic Data
3.1 Age Distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['age'], kde=True, bins=30, color='blue')
plt.title('Age Distribution of Employees')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
3.2 Gender Distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='gender', palette='Set2')
plt.title('Gender Distribution of Employees')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
We have now analyzed the employee demographics by calculating key statistics and visualizing the data to glean insights. This should allow us to understand the demographic makeup of the workforce comprehensively.
Attrition Analysis
In this section, we'll use Python to perform attrition analysis to understand and predict why employees leave a company. You can leverage this to make data-driven decisions to improve employee retention.
1. Data Preparation
Assume you have already loaded and cleaned your dataset.
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Assume `df` is your preprocessed DataFrame
# Split the data into features and target
X = df.drop("Attrition", axis=1)
y = df["Attrition"]
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
2. Model Training
We'll use a RandomForestClassifier for the prediction.
# Initialize the RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
model.fit(X_train, y_train)
Identify important features that contribute to attrition.
# Get feature importances
importance = model.feature_importances_
features = X.columns
# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
print(feature_importance_df)
5. Results Interpretation
Interpret the results to make business decisions.
The confusion matrix and classification report provide insights into the model's performance.
Accuracy score gives an overall idea of how well the model is performing.
The feature importance dataframe allows you to see which features have the most impact on employee attrition.
Use this information to delve deeper into the most influential features, such as job satisfaction, number of projects, and average working hours, and take corrective actions.
6. Conclusion
By understanding which features affect employee attrition and how accurately you can predict it, your organization can implement more effective retention strategies.
This hands-on tutorial demonstrated how to conduct an attrition analysis using machine learning models, evaluate the model performance, and extract insights to improve HR policies.
Performance Evaluation and Metrics
In the context of analyzing HR datasets, performance evaluation often involves measuring the efficacy of predictive models or assessing the health of employee metrics. Below is a practical implementation focusing on evaluating a predictive model for employee attrition using Python in Google Colab.
1. Importing Required Libraries
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
2. Load the preprocessed dataset
Assuming you have a DataFrame named df that has already been cleaned and preprocessed.
# Example: Load preprocessed data
df = pd.read_csv('preprocessed_hr_dataset.csv')
3. Splitting Data
from sklearn.model_selection import train_test_split
# Features and target variable
X = df.drop('Attrition', axis=1) # Features
y = df['Attrition'] # Target variable
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
4. Train a Predictive Model
from sklearn.ensemble import RandomForestClassifier
# Instantiate the model
model = RandomForestClassifier(random_state=42)
# Train the model
model.fit(X_train, y_train)
By utilizing these metrics, you can comprehensively evaluate the performance of your predictive models in your HR dataset analysis project.
Compensation and Benefits Analysis
In this section, we'll analyze the compensation and benefits data to gain insights into trends, distributions, and identify any potential disparities. We'll utilize DataFrames and visualization libraries available in Python within Google Colab.
Step 1: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load Data
Assume compensation_data.csv is the dataset containing the relevant information.
# Basic descriptive statistics
comp_df.describe()
# Distribution of Compensation Levels
plt.figure(figsize=(10, 6))
sns.histplot(comp_df['salary'], bins=30, kde=True)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()
Step 4: Compensation by Department and Job Title
# Average salary by department
avg_salary_dept = comp_df.groupby('department')['salary'].mean().reset_index()
# Plot average salary by department
plt.figure(figsize=(12, 8))
sns.barplot(data=avg_salary_dept, x='department', y='salary')
plt.title('Average Salary by Department')
plt.xlabel('Department')
plt.ylabel('Average Salary')
plt.xticks(rotation=45)
plt.show()
# Average salary by job title
avg_salary_title = comp_df.groupby('job_title')['salary'].mean().reset_index()
# Plot average salary by job title
plt.figure(figsize=(12, 10))
sns.barplot(data=avg_salary_title, x='salary', y='job_title')
plt.title('Average Salary by Job Title')
plt.xlabel('Average Salary')
plt.ylabel('Job Title')
plt.show()
Step 5: Gender Pay Gap Analysis
# Average salary by gender
avg_salary_gender = comp_df.groupby('gender')['salary'].mean().reset_index()
# Plot average salary by gender
plt.figure(figsize=(8, 6))
sns.barplot(data=avg_salary_gender, x='gender', y='salary')
plt.title('Average Salary by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Salary')
plt.show()
Step 6: Benefits Analysis
Assuming the dataset has columns related to benefits like 'health_insurance', 'retirement_plan', 'paid_time_off', etc.
# Scatter plot of salary vs performance score
plt.figure(figsize=(10, 6))
sns.scatterplot(data=comp_df, x='performance_score', y='salary')
plt.title('Salary vs Performance Score')
plt.xlabel('Performance Score')
plt.ylabel('Salary')
plt.show()
# Average salary by performance score
avg_salary_perf = comp_df.groupby('performance_score')['salary'].mean().reset_index()
# Plot average salary by performance score
plt.figure(figsize=(8, 6))
sns.barplot(data=avg_salary_perf, x='performance_score', y='salary')
plt.title('Average Salary by Performance Score')
plt.xlabel('Performance Score')
plt.ylabel('Average Salary')
plt.show()
Conclusion
This analysis helps identify trends and insights such as average compensation by department, job title, gender, benefits distribution, and correlation between compensation and performance. This structured approach enables a comprehensive understanding of compensation and benefits within the organization.
Predictive Modeling for Employee Attrition
You are now ready to create a predictive model based on the cleaned and preprocessed HR dataset. The goal is to predict whether an employee will leave the company (attrition).
Step 1: Import Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Step 2: Prepare the Dataset
Assuming your data is in a DataFrame named df, and the target variable (attrition) is a column named Attrition.
# Separate target variable
X = df.drop('Attrition', axis=1)
y = df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 3: Build and Train the Model
Using Random Forest Classifier as an example:
# Initialize the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf_model.fit(X_train, y_train)
Use the classification_report and confusion_matrix to understand the performance of your model. The accuracy_score gives a quick metric of how well your model is doing.
This implementation covers the complete pipeline for predictive modeling of employee attrition using a Random Forest Classifier in Python. You can adjust the model or hyperparameters as necessary to improve performance.
Part #11: Advanced Analytics with Machine Learning
Step 1: Preparing the Dataset for Advanced Analytics
Ensure your dataset is loaded and preprocessed correctly. Assuming the dataset from the previous sections is already cleaned and ready:
# Assuming 'hr_data' is your cleaned and preprocessed Pandas DataFrame
import pandas as pd
# Load preprocessed data
# hr_data = pd.read_csv('preprocessed_hr_data.csv')
Step 2: Split Dataset into Features and Target Variable
Here, we'll consider 'Attrition' as the target variable for classification tasks:
# Target variable
X = hr_data.drop(columns=['Attrition'])
y = hr_data['Attrition']
Step 3: Train-Test Split
Split the data into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
By following these steps, you should have successfully completed an advanced machine learning analysis on your HR dataset in Google Colab. The implementation provided includes model training, hyperparameter tuning, and evaluation, which helps in deriving meaningful insights and making data-driven decisions.
Reporting and Dashboard Creation
1. Import Necessary Libraries
Ensure you have the following libraries loaded. They are necessary for creating reports and dashboards in Google Colab.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotly import express as px
import plotly.graph_objects as go
from dash import Dash, html, dcc
2. Example Data Import
Assuming the HR dataset has been preprocessed, let's load the cleaned dataset.
app = Dash(__name__)
app.layout = html.Div(children=[
html.H1(children='HR Data Dashboard'),
dcc.Graph(
id='example-graph',
figure=px.histogram(df, x='YearsAtCompany', title='Employees by Years At Company')
),
dcc.Graph(
id='attrition-rate',
figure=px.pie(df, names='Attrition', title='Attrition Rate')
),
dcc.Graph(
id='dept-distribution',
figure=px.bar(df, x='Department', y='EmployeeCount', title='Department Distribution')
)
])
if __name__ == '__main__':
app.run_server(debug=True)
5. Combining and Serving the Dashboard
Ensure the DataFrame manipulations and visualizations are cohesive and integrate them into a running dashboard.
# Additional Graphs and Components as Needed
app.layout = html.Div(children=[
html.H1(children='HR Data Dashboard'),
dcc.Tabs([
dcc.Tab(label='Overview', children=[
html.Div([
dcc.Graph(
id='overview-bar',
figure=px.bar(df, x='JobRole', y='MonthlyIncome', title='Monthly Income by Job Role')
),
dcc.Graph(
id='overview-pie',
figure=px.pie(df, names='Gender', title='Gender Distribution')
)
])
]),
dcc.Tab(label='Attrition Analysis', children=[
html.Div([
dcc.Graph(
id='attrition-histogram',
figure=px.histogram(df, x='Age', color='Attrition', barmode='group', title='Attrition by Age')
),
dcc.Graph(
id='attrition-dept',
figure=px.bar(df, x='Department', y='AttritionRate', title='Attrition Rate by Department')
)
])
]),
# Additional tabs can be defined here
])
])
if __name__ == '__main__':
app.run_server(debug=True, port=8050)
Additional Sections
You can expand with more plots and computations as needed and add them to the dashboard layout.
With this implementation, you should be able to run an interactive HR report and visualization dashboard directly in Google Colab or any local environment supporting Plotly and Dash. The provided code snippets cover fundamental aspects of reporting and dashboard creation from loading data to rendering interactive graphs.