Quality and Governance of Clinical data

Quality of clinical data and metadata

As discussed recently by Harrow et al. 2022,  the FAIR data principles gain increasing popularity and acceptance. It is easy to assume that implementation of FAIR would be sufficient to drive the data management strategy in the life sciences. For example, comprehensive assessment using the FAIR metrics can result in valuable recommendations for improving the quality of rare disease registries. However, the FAIR principles only indirectly consider data quality at the level of provenance and meeting community standards for the whole data set. These are likely to be insufficient measures of data quality, especially in strictly governed environments, such as submissions of clinical trial data to the regulators by pharmaceutical companies, or electronic healthcare records (EHRs) kept by hospitals. Thus, it is not surprising that methods and dimensions for quality assessment have been developed for reuse of clinical data from EHRs.

Data quality assessment of electronic health records (EHRs)

Data Quality Assessment (DQA) of electronic healthcare records (EHRs) starts with analysis of relevant abstracts from the biomedical literature in PubMed to identify the dimensions of data quality and common methods to assess data quality, as described by Weiskopf and Weng. Five dimensions of data quality have been identified as: (i) completeness; (ii) correctness; (iii) concordance; (iv) plausibility; and (v) currency, which were mapped to seven methods for data quality assessment: (i) gold standard; (ii) data element agreement; (iii) element presence; (iv) data source agreement; (v) distribution comparison; (vi) validity check; and (vii) log review. The strongest evidence from this mapping was found between the dimensions of (i) completeness and (ii) correctness to the methods of (i) gold standard, (iii) element presence and (ii) data element agreement (Figure 11) .


Figure 11. Mapping between dimensions of data quality and data quality assessment methods. Dimensions are listed on the left and methods of assessment on the right, both in decreasing order of frequency from top to bottom. The weight of the edge connecting a dimension and method indicates the relative frequency of that combination. Reproduced, with permission, from [Weiskopf 2013].

Related work harmonised a greater number of data quality terms to design a more complex conceptual framework for defining whether EHR data are ‘fit’ for specific uses. [Ref]. This DQA framework comprises three broad categories: (i) conformance to specified standards or formats; (ii) completeness to evaluate data attribute frequency within a data set without reference to the data values; and (iii) plausibility with respect to a range or distribution of data values. Each of these categories include the following seven subcategories: (ia) value conformance; (ib) relational conformance; (ic) computational (calculation) conformance; (ii) completeness; (iiia) unique plausibility; (iiib) atemporal plausibility; and (iiic) temporal plausibility. All of these apply to metadata elements and data values, with the exception of computational conformance, which only applies to data values. Review of cardiac failure research study guidelines led to the identification of six categories of frequently used and clinically meaningful phenotypic data elements: (i) demographics; (ii) physical examination or baseline observation; (iii) diagnostic tests; (iv) patient medical history; (v) clinical diagnoses or presentation; and (vi) medications. These enabled the assembly of an inventory framework for the data elements, organised by the six categories of phenotype.[Ref] This is an example of DQA framework application to research studies from cardiac failure research.


Wider applicability of DQA to clinical data in addition to FAIR (FAIR+Q)

The DQA framework approach can be applied to other types of clinical data, such as those found in clinical trial registries, health claims databases, and health information exchanges. This approach can also be applied more broadly by compiling relevant harmonised terms (e.g., from existing vocabularies, ontologies, or natural language processing) as the starting point to identify the most relevant dimensions of data quality and methods for assessment, many of which are likely to be shared within the clinical domain.


FAIRification combined with quality assessment for clinical data

Although making clinical data FAIR is likely to release more value, this will probably be insufficient because the quality of data is only addressed indirectly, through unspecified provenance and community standards. Quality assessment is crucial for clinical trial and healthcare data for submission to regulators to demonstrate the efficacy and safety of a new treatment. Therefore, it is not surprising to find that assessment of quality for clinical data, especially in EHRs, has matured as described in the previous section. Here, we argue that it would have a greater impact to apply both the FAIR data metrics (maturity indicators) and data quality assessment (FAIR + Q) to clinical trial and healthcare data sets. The process of FAIR + Q assessment followed by enhancement of metadata and selection of quality data sets, guided iteratively by use cases, is likely to release maximum value from clinical data, while satisfying the rigour of regulatory submissions and vital decision-making of healthcare services.

Clinical data governance at the study level

What is data governance?

Numerous definitions can be found; below are three typical examples:

  1. DataVarsity (https://www.dataversity.net/what-is-data-governance/# )

    1. Data governance is a collection of components – data, roles, processes, communications, metrics, and tools – that help organisations formally manage and gain better control over data assets. As a result, organisations can best balance security with accessibility and be compliant with standards and regulations while ensuring data assets go where the business needs them most.

  2. Wikipedia (https://en.wikipedia.org/wiki/Data_governance )

    1. Data governance is a data management concept concerning the capability that enables an organisation to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives.

  3. Talend (https://www.talend.com/resources/what-is-data-governance/ )

    1. Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organisation to achieve its goals. It establishes the processes and responsibilities that ensure the quality and security of the data used across a business or organisation. Data governance defines who can take what action, upon what data, in what situations, using what methods.


It is clear from these definitions that data governance has a strong relationship with data management, although they are distinct from each other which is illustrated in Figure 12. It shows how data governance in an organisation often has a “top-down” orientation, where senior management has ultimate responsibility for the ownership of all the assets of an organisation, including its data. Senior management makes use of a well defined strategy for data governance which includes key players and business processes. The prime players for data governance are 1) the data stakeholders who benefit and depend on the data, 2) the data council who define the quality necessary for the data assets (below) and 3) the data architect (team?) who design the appropriate infrastructure for storage and management of the data assets. The data policies and rules of engagement will be defined and maintained by these players. This will likely include formalising and recording the mission (Why), focus areas e.g. use cases (What) and the data policies, rules, metrics and definitions (How). 

Governance of the quality of the data assets is a central purpose for the data council. They define the scope, audit the data assets, define quality control and ownership which is published and reviewed periodically to sustain the longevity of the quality data assets.

FAIR data management can be seen to have a “bottom-up” orientation in relation to data governance, where the key players are the data stewards who facilitate best practice which drives the FAIR data and metadata life cycle, as illustrated in Figure 12. The data stewards (or equivalent role) will need to work closely with the data council and the data stakeholders to make all of this work as a coherent system, serving the business of organisations such as pharmaceutical companies.

Figure 12: How Data Governance relates to FAIR Data Management in a typical large enterprise, such as a biopharmaceutical company.


How does data governance relate to FAIR clinical data at the study level?

A number of recommendations for improving the quality of Rare Disease registries have been published in 2018 by Kodra and coauthors ( ). They describe a framework for quality management of RD registries which includes establishment of a good governance system and construction of a suitable computing infrastructure which complies with the FAIR principles (see Figure 1 in the review paper). 

Five recommendations for governance of a Rare Disease registry are given: 1) Define clear objectives to inform the design of the registry database, 2) Identify and engage with the relevant stakeholders at an early stage, 3) Build the registry team who have clear roles and responsibilities in proportion to the registry size, ambitions and objectives, 4) Build a solid framework to ensure compliance with ethical and legal requirements and 5) Ensure the required budgets have been evaluated, so that the registry is well resourced for a predefined period.

Recommendations such as these should take account of registry scope; at the study level only protocol methods and summary results will be disclosed, rather than personal patient data. Therefore, for this FAIR4Clin guide, which is limited to the study level, we have described already how it is only relevant to consider FAIR implementation for the methods description of protocol (metadata) and results summary (data), submitted to the ClinicalTrials.org registry.