MS3: How do I make sure the model produces sensible answers?

Last edited: 2022-07-21 by Chas Nelson

The short version

  1. Leverage multidisciplinary expertise through the development of and use of an AI/ML system (e.g. data engineers, data scientists, domain experts, UI/UX designers, etc.)

  2. Ensure development datasets are representative of end-use data, i.e. real world data

  3. Focus on the human-AI team (sometimes referred to as a ‘human-in-the-loop’ approach) through the development of and throughout the use of an AI/ML system

  4. Monitor deployed models for performance (e.g. data drift) and adopt a continuous risk assessment and mitigation plan, which may include retraining models whilst being aware of the retraining risks (e.g. catastrophic forgetting).

The longer version

Note: much of the following has been adapted from: https://medium.com/gliff-ai/gmlp-good-machine-learning-practice-what-the-principles-mean-when-developing-medical-ai-77ce63f5407b co-written by Chas Nelson (CTO, Founder & Director, http://gliff.ai ) and Lucille Valentine (Head of Regulation and Compliance, http://gliff.ai )

Leverage multi-disciplinary expertise

  • AI/ML systems should include all relevant experts from the beginning of development through the AI/ML model development process and also throughout the lifecycle of the system, including (indeed, especially) post-deployment, i.e. the total AI/ML product lifecycle.

  • Such a multidisciplinary team can more easily assess the need for the AI/ML system; the appropriate and representative dataset to use; ensure data annotation/labelling is completed by the domain expert (i.e. creating the highest quality dataset possible); how to measure the benefit of the AI/ML system; and how best to deliver the AI/ML system to the end user in a way that engenders trust/confidence as well as understanding of the limitations of the system.

  • Such a multidisciplinary team may work well with standard Agile approaches to development where domain experts might take the role of “Product Owners”.

  • Engagement with a wider customer base (additional domain experts and/or end-users) throughout the AI/ML product lifecycle (e.g. user acceptance testing) can ensure the AI/ML system works as expected and provides usable insights.

Use real world data

So that AI/ML model results can be appropriately generalised, the dataset has to be representative of the data the AI/ML model will receive once deployed. There’s no point developing and testing an AI/ML purely on black and white images if its planned use is purely colour images — even if the features that you think your ML will use are shared across both data types.

Model developers and data scientists must ensure that the relevant characteristics of the intended real word data are sufficiently represented in an adequately sized database for AI/ML data. Without considering all of the potentially impacting characteristics, datasets and the ML models trained on them can be biassed, imbalanced or unrepresentative — and lead to unusable or, worse, unsafe outcomes.

Ensuring that relevant characteristics are included in the datasets (i.e. metadata) allows model developers to assess (and perhaps mitigate) bias and imbalance, assess usability, and identify circumstances where the model may underperform (edge cases).

The more metadata in a dataset, the more factors which can be thoroughly and robustly investigated. Contrariwise, the more metadata in, for example, a clinical dataset, the more personally identifiable information may be included too, leading to a potential data privacy and security risk. One solution to this risk is to use techniques like end-to-end encryption of data, which is one of the highest standards for data security to ensure that data breaches or leaks are incredibly unlikely and thus reduce that risk.

Focus on human-AI teams

There are two human-AI teams in ML medical devices: the developing team and the end-use team.

Developing human-AI teams should focus on using the “human-in-the-loop” principle, where an expert human guides the cyclical development of the ML, to ensure that good machine learning practices are implemented; that input datasets are high quality and that results of testing and validation are appropriate and of high quality.

End-user human-AI teams need to be supported by being provided clear, essential information on how to use the AI/ML system, when to use the AI/ML system and why to use the AI/ML (as opposed to existing systems). The developing team should work with UI/UX designers to ensure the right way to represent the insights the AI/ML model can offer in a way that positively augments the end-user's existing processes. For example, with more transparent approaches, predictions can be displayed alongside features generated by the model and relevant training set instances. Presenting this information to the user gives them the ability to challenge prediction confidence and be more confident with subsequent decision making. The performance of the end-user human-AI team should be measurable and predictable and ensured by good software engineering and user experience practices.

Here, the “human-in-the-loop” principle can be flipped and the aim should be to have an “AI-in-the-loop”. In this case the human end-user is in control and leads the process while the AI/ML system supports that human by providing interpretable outputs that can be used for the task at hand.

Monitor performance and mitigate risk

There are many reasons why the performance of an AI/ML system might change after deployment and release to the end-user human-AI team.

Changes in the underlying subjects, sensors or environment may impact the performance of the human-AI team. As such, models should be continuously monitored to ensure no degradation of performance is occurring that could potentially lead to a negative impact on the human-AI team.

One common solution to this is to regularly retrain the AI/ML model on continuously updated datasets that are also subject to good machine learning practice. However, retraining AI/ML systems introduces potential risks, e.g. the AI/ML model might provide different results for the same input before and after retraining (e.g. catastrophic forgetting). Retraining of models should, therefore, undergo the same rigorous controls as the original model development, i.e. leverage multidisciplinary expertise and real world data, focussing on the human-AI team and so on.

There is the need to consider performance measurements and risk assessment and mitigation continuously throughout the development lifecycle and the product lifetime for any AI/ML system. These processes may often be the same for both development and post-deployment monitoring but will differ based on the context of the AI/ML system.

One key practical impact is that training/test datasets and AI/ML models should be version controlled, with the ability to easily rollback systems to an earlier, more performative version and/or to compare the performance of different versions of the AI/ML model trained on the same or different datasets. This brings AI/ML systems in-line with industry-standard software engineering practices.