Skip to main content

Table 3 Candidate model pipeline framework and data transformation steps produced by SageMaker Autopilot©

From: Machine learned daily life history classification using low frequency tracking data and automated modelling pipelines: application to North American waterfowl

Model #

Framework

Data transformation steps

1

XGBoost

Create Threshold One Hot Encoding (threshold = 30) for categorical/sparse features

2

LinearLearner

Converts features with extreme values to a uniform distribution

Feature dimension reduction using PCA

3

LinearLearner

Scaling and centering features while accounting for data sparsity only

4

XGBoost

Create threshold one hot encoding (threshold = 5) for categorical/sparse features

5

LinearLearner

Create threshold one hot encoding (threshold = 6) for sparse features

Feature dimension reduction using PCA

6

LinearLearner

Create threshold one hot encoding (threshold = 7) for categorical/sparse features

7

LinearLearner

Create threshold one hot encoding (threshold = 7) for categorical/sparse features

Feature dimension reduction using PCA

8

XGBoost

Create threshold one hot encoding (threshold = 7) for categorical/sparse features

9

XGBoost

Create threshold one hot encoding (threshold = 9) for categorical/sparse features

10

MLP

Scaling and centering features while accounting for data sparsity only

  1. Data processing steps utilize functions from the AWS ScikitLearn extention (https://github.com/aws/sagemaker-scikit-learn-extension, copyright AWS 2019). Models represent 3 frameworks: Extreme Gradient Descent (XGBoost); Stochastic Gradient Descent (aka LinearLearner); and Multi-Layered Perceptron. Data transformation for each candidate pipeline automatically included imputation of missing values which were not present in training data. Each candidate model included a processing step to scale and center features while accounting for data sparsity