DS3: How do I tune the hyperparameter values?

Short version:

  1. Hyperparameters in AI/ML models are used to drive the learning process, and as such, cannot be inferred from the training data. Examples include topology and size of the neural network, or the learning rate and batch size.

  2. The number of possible combinations of hyperparameter value choices can be large. Since each combination requires a full cycle of training and testing of an AI/ML model, using a brute force grid search where every possible combination of values is checked may be infeasible. Random search (where not all, but randomly chosen combinations of hyperparameter values are used), Bayesian optimization [Bergstra et al., 2010; Snoek et al., 2012] and evolutionary optimization algorithms [Bergstra et al, 2010; Such et al., 2017; Han et al., 2021] are the recommended techniques. Early-stopping hyperparameter optimization algorithms periodically prune low-performing models [Jamieson et al., 2015; Li et al., 2020]. Additional heuristic approaches for hyperparameter tuning were evaluated by [Polap et al., 2022].

  3. Evaluate the model by using the n-fold cross-validation, aiming to preserve the statistical characteristics such as class distribution or value ranges in both training and test partitions.

  4. Evaluate the robustness of the model with the standard deviation or the confidence interval of the performance metric calculated for multiple values of the test data. For a quality model, the standard deviation of the performance metric should be low. Note the ranges of input values for which the prediction quality is high or low.

  5. It is highly recommended to assess the generalization performance of an AI/ML model on a data set completely independent from the data sets used in hyperparameter tuning.

 

Long version:

 

Hyperparameters are parameters that define a machine learning (ML) algorithm. They play a crucial role in determining the model's complexity and algorithmic details of the learning process. One of the primary reasons for tuning hyperparameters is to reduce overfitting, which occurs when the model is too complex and performs poorly on unseen data during training.

 To tune hyperparameters, the available data is divided into two sets: a test set, reserved exclusively for final evaluation, and a training set, used to optimize hyperparameters and train the model. The partitioning of data typically aims to preserve statistical characteristics in both sets, such as class distribution or value ranges.

 Since typically the number of potential hyperparameter combinations is too large, it is impractical to search and evaluate all possibilities. Instead, data scientists use heuristics to reduce the search space guided by reported ML models using the same algorithm and comparable data, personal experience, and recommendations on default values. Recently, various packages offering AutoML software that selects ML algorithms and tune hyperparameters have been made available, however, their algorithms and hyperparameter ranges may not be optimized for the particular dataset at hand. Once the search space is narrowed down, a search strategy needs to be chosen. Grid search, for instance, involves exhaustively sampling all the hyperparameter combinations. In cases where the search space is still too large, stochastic algorithms like random search, genetic algorithms, or Bayesian optimization can be utilized, albeit requiring user configuration.

 A sampled set of hyperparameters is then evaluated using an additional split of the training data. An example is n-fold cross-validation, where the training dataset is partitioned into n equal-sized subsets. The hyperparameters are then sequentially evaluated by training and testing a model using each of the n subsets as a test set while using the remaining data as training set. The average performance score across all n tests serves as the final score for that particular hyperparameter set. There is no consensus on which performance score to use here, this may not be a sensible choice since different scores are usually directly correlated (and increase in one implies an increase in the other). Additional statistics, such as the standard deviation of the metric, can be computed to assess the robustness of the ML model.