Clinical study data process through the FAIR lens

Data collection in a clinical study is usually highly regulated and thus a sequential process. It starts with planning and preparation, obtaining legal approvals, conducting the study and primary and secondary data analysis (see figure 4). In this section we want to briefly discuss each step and especially highlight which aspects of FAIR are important to consider and how FAIR data can be beneficial in later steps of the process.  

Figure 4 : adapted from

Planning and preparation of the study

Prospective FAIRification means moving a number of processes upstream, before the start of the data collection process. How data is collected should be defined in the FAIR Data Management Plan with the FAIR principles implemented to enable future re-use of the data and metadata. For example, domain specific standards and vocabularies should be clearly identified and incorporated as required for interoperability as defined by the FAIR principles. 

FAIR implementation: Interoperability, use of syntax standards and vocabularies

Approvals, permissions and agreements

Before a clinical study can go ahead, a range of  approvals (e.g. regulatory, ethical) need to be obtained and potential restrictions (e.g. in terms of access and data use) need to be clarified so an informed consent form can be created. Data management plans need to define storage,retention periods, condition of access and consider aspects of re-use and data-sharing. As the legal framework between countries varies (e.g. EU GDPR regulation, article 35 and 36 about data protection impact assessment), important considerations need to be weighed and being able to express such constraints in machine readable form constitutes an important FAIRification task as it has a significant impact on data reuse possibilities.  

FAIR implementation: Accessibility, open communication protocol standard(s) to control access.

FAIR implementation: Reusability, consider licences and consent to allow the re-use of data and metadata from a legal perspective. The FAIR cookbook recipe on representing permitted use using open standards ( ) provides a good starting point with hands-on examples.

Study conduct and data collection 

Once a study is conducted and data collection initiated, more technical aspects come to the fore. How is the data referenced? How/where is data actually stored? Depending on the type of processes generating the data, different infrastructures might be used (see also part I “A multiplicity of clinical data types”). Primary resources for healthcare and clinical research can be identified as follows: 

  • Data generated as part of routine healthcare processes: Electronic Healthcare Records, Radiology, Pathology, Genomics, health insurance claims etc. 

  • Data generated as part of (interventional) clinical research: Clinical Trials / Studies

  • Data generated for research and/or quality monitoring purposes: Observational Registries / Databanks / Biobanks

  • Data generated by patients as part of self-monitoring via medical devices or patient reported outcomes.

As study sponsors often rely on contract research organisations (CRO) to perform the tasks of data collection, especially in the context of multicentric and multimodality studies, careful considerations should be made to properly specify how data should be provided and taken care of. This often means going down to the specifics of data format, controlled vocabulary choices and validation pipelines to ensure consistency from the start and avoid  added downstream curation costs to reconcile different data sources. The topic of interacting with CRO is covered in a recipe available from the FAIR cookbook ( ), which was contributed by Novartis AG.

FAIR implementation: Findability, an identifier strategy is essential to support future findability of data and metadata in a system (see Pistoia’s FAIRToolkit Adoption and Impact of an Identifier Policy as an example for an identifier strategy).

FAIR implementation: Interoperability: Convergence on terminology (e.g. LOINC for laboratory test coding).

Data curation / harmonisation processes

In a highly restricted and regulated environment, reviewing the data quality and curating the (raw) data is inevitable - even though data quality per se is not a dimension of FAIR (see publication “Maximizing data value for biopharma through FAIR and quality implementation: FAIR plus Q”). How much effort is needed to curate the associated metadata depends on the care and data entry validation procedures laid out during planning phases and while conducting the data collection of the study. Disparate clinical data sources might have to be harmonised for the purpose of healthcare quality assessments. Besides semantic interoperability, technical interoperability might be an obstacle when integrating multiple healthcare systems, across various geographical entities and jurisdiction in the case of multi-centric, multi-country trials. In such a setup, in order to comply with local regulations, data may have to be withheld.    

FAIR benefit: Prospective FAIR data reduced the manual effort needed to curate datasets by embedding annotation requirements and validation rules in data acquisition systems and procedures have been shared and explained with involved parties, including CROs. (See Pistoia’s FAIR Toolkit use case, FAIRification of clinical trial data as an example for retrospective FAIRification. Also, see another use case Prospective FAIRification of Data on the EDISON platform as an example of prospective FAIRification of data).

Analysis of data

Analysis of the collected data might include the efficacy of drug and adverse events during the study or health and safety and signal detection. Complex analysis might involve multiple data sources or systems. 

FAIR benefit: FAIR data significantly reduced the effort to integrate and combine data from different sources


Clinical studies, if promising, are submitted for publication or to the authorities. After the primary submission, flexible reporting still is crucial. Companies have to engage with the authorities if there is e.g. a label change once the drug is on the market or when repurposing a drug. 

FAIR benefit: Integrated data supports reporting and question answering 

Data follow-up activities 

The main purpose of the FAIR guiding principles is to enable and facilitate reusability of data. Follow up activities like sharing of data in private or public data catalogues, anonymisation of data or re-use of datasets for research purposes are very common use cases, which should be considered by stakeholders and in the data management plan. 

FAIR benefit: Following the FAIR principles unlocks longevity of data and enables secondary use by providing clear data licences