Prompt
Answer
Pseudo Code for Ship Cargo Loading Volume Prediction
This pseudo code translates the provided Python script into a structured format, outlining the major steps taken to preprocess the data and generate features for a machine learning task focused on predicting cargo loading volumes.
Imports and Warnings Management
- Import Required Libraries:
- Import data manipulation libraries: Pandas, NumPy
- Import machine learning libraries: LightGBM, XGBoost
- Import metrics and validation techniques: Mean Squared Error, StratifiedKFold, KFold, GroupKFold
- Manage warnings by ignoring them.
Constants
- Define a constant for the label (target variable):
LABEL
is set to '装载量'.
Data Loading
Read Training and Testing Data:
- Load the training dataset from a specified path with appropriate encoding.
- Load the testing dataset from a specified path with appropriate encoding.
Combine Training and Testing Data:
- Concatenate training and testing DataFrames into one.
Data Preprocessing
Handle Missing Values and Type Conversion:
- Replace ' None' in departure time with NaN and convert to float.
- Calculate the time difference:
time_diff
as the difference between departure and arrival times.
Convert Timestamps:
- Convert '进泊时间' and '离泊时间' to datetime format using Unix timestamps.
Feature Engineering
Extract Time Features:
- For each time variable ('进泊时间', '离泊时间'):
- Extract hour, day of week, and quarter.
- For each time variable ('进泊时间', '离泊时间'):
Encode Categorical Variables:
- Factorize categorical codes (A and B) to create numeric representations.
Generate Statistical Features:
- For selected numeric features '载重吨' and 'time_diff':
- For each categorical feature ('A_le', 'B_le', 'AB_le'):
- Calculate and store mean, standard deviation, max, and min.
- For each categorical feature ('A_le', 'B_le', 'AB_le'):
- For selected numeric features '载重吨' and 'time_diff':
Additional Statistical Relationships:
- For each numeric feature, calculate mean, standard deviation, max, and min against each categorical feature.
Location Feature Extraction:
- Parse location strings into separate float values (loc0 and loc1).
Combination Features:
- Calculate area as the product of ship length and width.
- Count occurrences grouped by ship ID and ship type codes.
Group Counts:
- For each unique value of ship ID, width, and loaded tonnage, count occurrences.
Longitude/Latitude Ratio:
- Calculate a derived geographic feature as the ratio of loc0 to loc1.
Pseudo Code Representation
BEGIN
# Import libraries
IMPORT pandas AS pd
IMPORT numpy AS np
IMPORT lightgbm AS lgb
IMPORT xgboost AS xgb
IMPORT mean_squared_error AS mse
IMPORT StratifiedKFold, KFold, GroupKFold
# Suppress warnings
ENABLE warnings
SET warnings TO ignore
# Define LABEL
SET LABEL = '装载量'
# Load datasets
df_train = READ_CSV('./data/船舶装卸货量预测-训练集-20240611.csv', encoding='gbk')
df_test = READ_CSV('./data/船舶装卸货量预测-测试集X-20240611.csv', encoding='gbk')
# Combine datasets
df = CONCATENATE(df_train, df_test)
# Preprocess data
REPLACE ' None' WITH NaN IN df['离泊时间']
CONVERT df['离泊时间'] TO float
df['time_diff'] = df['离泊时间'] - df['进泊时间']
CONVERT df['进泊时间'] AND df['离泊时间'] TO datetime
# Extract time features
FOR each f IN ['进泊时间', '离泊时间']:
df[f + '_hour'] = EXTRACT HOUR FROM df[f]
df[f + '_dayofweek'] = EXTRACT DAY OF WEEK FROM df[f]
df[f + '_quarter'] = EXTRACT QUARTER FROM df[f]
# Non-numeric encoding
df['A_le'] = FACTORIZE(df['船舶类型代码A'])
df['B_le'] = FACTORIZE(df['船舶类型代码B'])
# Generate cross statistics
FOR each num_f IN ['载重吨', 'time_diff']:
FOR each cat_f IN ['A_le', 'B_le', 'AB_le']:
df[cat_f + '_' + num_f + '_mean'] = GROUPBY(cat_f, df[num_f], APPLY MEAN)
# Repeat for std, max, min using a similar structure
# Create location features
df['loc0'] = MAP(df['泊位位置'], EXTRACTION FUNCTION)
df['loc1'] = MAP(df['泊位位置'], EXTRACTION FUNCTION)
# Combination features
df['面积'] = df['船长'] * df['船宽']
df['船舶ID_航次ID_count'] = GROUPBY(df['船舶ID'], COUNT(df['航次ID']))
# Repeat for other combination statistics using GROUPBY
# Calculate longitude/latitude
df['longitude/Latitude'] = df['loc0'] / df['loc1']
END
Notes
- Each segment of the code has been translated into its logical progression.
- The pseudo code is designed to present the process in an understandable manner, facilitating discussions or implementations without direct reference to programming specifics.
Description
This pseudo code outlines the steps to preprocess data and create features for a machine learning model aimed at predicting ship cargo loading volumes. It covers data loading, missing value handling, feature extraction, and statistical analysis in a structured format.