DS4: How do I validate a model?

 

 

The short version:

 

  1. Refer to the DOME recommendations, especially the section around Data and Evaluation. This is true for any community-accepted standard around ML (DOME is specifically designed for ML in Supervised Learning in Biology)

  2. Based on objective metrics, and the overall reporting of the ML process, quickly fail models that fail either quantitative or qualitative threshold. A deceptive model is worse than a bad one

  3. Connect to regulatory frameworks that are in place (and aim to assess/validate ML processes)

 

 

 

 

A longer version (adapted from the DOME recommendations text)

 

State-of-the-art ML models are often capable of memorizing all the variation in training data. Such models when evaluated on data that they were exposed to during training would create the illusion of mastering the task at hand. However, when tested on an independent set of data (termed a test or validation set), the performance would seem less impressive, suggesting low generalization power of the model. To tackle this problem, initial data should be divided randomly into non-overlapping parts. The simplest approach is to have independent training and testing sets (and possibly a third validation set). Alternatively, the cross-validation or bootstrapping techniques that choose a new training/testing split multiple times from the available data are often considered a preferred solution

 

There are two types of evaluation scenarios in biological research. The first is the experimental validation of the predictions made by the ML model in the laboratory. This is highly desirable but beyond the scope of many ML studies. The second is a computational assessment of the model performance using established metrics. The following deals with the latter. There are a few possible risks in computational evaluation.

 

To start with performance metrics—that is, the quantifiable indicators of a model’s ability to solve the given task—there are dozens of metrics available for assessing different ML classification and regression problems. The plethora of options available, combined with the domain-specific expertise that might be required to select the appropriate metrics, can lead to the selection of inadequate performance measures. Often, there are critical assessment communities advocating certain performance metrics for biological ML models—for example, Critical Assessment of Protein Function Annotation (CAFA) and Critical Assessment of Genome Interpretation (CAGI)—and we recommend that a new algorithm should use metrics from the literature and community-promulgated critical assessments.

 

In the absence of literature, the following metrics are usually good indicators:

  • Classification metrics: For binary classification, true positives (tp), false positives (fp), false negatives (fn) and true negatives (tn) together form the confusion matrix. As all classification measures can be calculated from combinations of these four basic values, the confusion matrix should be provided as a core metric. Several measures and plots should be used to evaluate the ML methods. However, these metrics need to be adapted if we go towards a multi-class problem.

  • Regression metrics: ML regression attempts to produce predicted values (p) matching experimental values (y). Metrics attempt to capture the difference in various ways. Alternatively, a plot can provide a visual way to represent the differences. It is advisable to report all these measures in any ML work. ROC, receiver operating characteristic; AUC, area under the ROC curve; RMSE, root mean squared error; MAE, mean absolute error.

 

Once performance metrics are decided, methods published in the same biological domain must be cross-compared using appropriate statistical tests (for example, Student’s t-test) and confidence intervals. Then, to prevent the release of ML methods that appear sophisticated but perform no better than simpler algorithms, baselines should be compared to the ‘sophisticated’ method and proven to be statistically inferior (for example, as in comparison of shallow vs. deep neural networks).