Clinical study design and FAIR implementation

Good scientific practice for life science research and development requires careful planning and design for the collection of data from any source, such as laboratory experiments, clinical studies, databases and scientific literature. A Data Management Plan (DMP) is an essential element in the process to manage the FAIR data life cycle. It will document the plan for the dataset and associated metadata in specific terms of what, how, who and when. This process will support making the data FAIR by design and is highly relevant to clinical study design. More details on the DMP method can be found in the Pistoia Alliance FAIR Toolkit:

Clinical studies are traditionally described in human readable form as protocol documents, which represent one of the main types of clinical trial descriptors. Typically, a protocol for a clinical trial contains a study description, usually available as free text, a specification of the nature of the interventions and how the trial will be conducted, as well as, for example, safety concerns and other ethical considerations. However, requirements from the regulators, funding agencies and publishers now mandate moving away from textual narrative to much more syntactically and semantically structured reporting. From a general information management standpoint, but also more specifically from a FAIR implementation viewpoint, “free text” is problematic as, without elaborate methods such as text mining and manual curation to structure and meaningfully annotate a study, key information lacks machine actionability. Therefore, submitting a clinical trial to the regulatory authorities or uploading a clinical study to one of the public repositories, e.g, requires structuring of the data through the mandatory use of a designated format specification. For instance, in the context of regulatory submission, the CDISC SDTM clinical standard is mandated. 

However, adhering to these specifications is not enough to achieve a FAIR maturity level that enables full machine actionability (see section 3 of this guide) and computable knowledge still remains out of reach for software agents even when the text description has a standardised structure. To further demonstrate the problem, we considered the following trial as our test bed:  “Effect of Propolis or Metformin Administration on Glycemic Control in Patients With Type 2 Diabetes Mellitus” available from (link).

Glycemic control is naturally the principal topic of diabetes and complications that can be developed as a consequence of loss of sensitivity to perceive insulin signals by the cell. The glycemic control goals established by the ADA are: glycosylated hemoglobin (A1C) <7.0%, fasting plasma glucose 80-130 mg/dL and casual plasma glycemia <180 mg/dL. The first-line treatment in patients of recent diagnosis is metformin, however, studies have shown that propolis, a resinous balsamic material collected by the Apis mellifera bee, from sprouts, exudates of trees and other parts of the plants, represents a very important and promising natural alternative in medicine, which can be considered as an antidiabetic agent.

The aim of this study is to evaluate the effect of propolis or metformin administration on glycemic control in patients with type 2 Diabetes Mellitus without pharmacological treatment. The investigators hypothesis is that propolis or metformin administration, modify the glycemic control in patients with type 2 Diabetes Mellitus without pharmacological treatment.

Figure 2 This study description is taken from

The initial examination unsurprisingly reveals that the “Brief summary” of the trial is human readable and, more interestingly that the same information can be, it can also be accessed programmatically via an API providing a json file to an HTTPS request. Upon closer inspection however, the web pages for human consumption reveals that only provide basic metadata markup by relying on the OpenGraph protocol is provided. Search engine optimisation (SEO) using markup is absent and may somewhat limit discoverability and findability.

We then focused on the data submission process: In order to submit the trial to ,  the (meta)data has to be submitted in a structured way, a set of key/value pairs, which can be presented as a table containing the study protocol and summary results, in addition to the free text (see below).

Figure 3: Structured descriptive Information (i.e. structured metadata) for . Structured summary results are available as well in a separate tab.

Whilst providing structured text, this is still insufficient to qualify as FAIR. Mature implementation of the FAIR guiding principles goes much further than this: for instance, each key should be associated with an entity from a semantic model via an uniform resource identifier (URI) that is GUPIR and the associated value should also be marked up for indexing by search engines. 

With these observations in mind and with the knowledge that FAIRifying data and metadata retrospectively, can be very costly and time consuming (see the Roche use case in the Pistoia Alliance FAIR Toolkit - FAIRification of clinical trial data), the following sections will expand on the notion of prospective FAIRification of data and metadata. This approach, by setting up from the start to implement the FAIR guiding principles to facilitate secondary reuse and by design saves costs, ensures efficiency and supports longevity of data and metadata. We explore the key aspects of prospective FAIRification of data and metadata from clinical studies in the following chapter “Clinical study data process through the FAIR lens”.