Q&A for FAIR by Design webinars by Mathew Woodwark @AZ, Erik Schultes @GO-FAIR & Georges Heiter @Databiology

Questioner

Question Asked

Answers (from Ian Harrow @Pistoia or Mathew Woodwark @AZ)

1

How do we get from IT fixing the data discrepancies to business ownership and drive on the information structure and stewardship?

Ian: This is likely to require a dual strategy. Top-down buy-in from senior management to invest in the necessary infrastructure and visible recognition of FAIR data as a valuable corporate asset. Equally important is bottom-up FAIR data management by scientists supported by local data stewards and  appropriate policies to define best practice.

2

A question for Matthew.  Thank you for your presentation.  You mentioned "Compliance by Design" in the AZ model.  Considering GDPR Data Privacy, I am curious how you considered the storage location for your data lake (contianing multi-country patient data) and control for processing from data scientists from multiple countries?

Mathew: With GDPR, as with other regulations that go across multiple countries where approaches may vary by country, we hold ourselves to the higher standard. In the case of data location for Science Data Foundation, we place our cloud data storage in the EU. The data for secondary use is subsetted to removed "sensitive country " data (i.e. data fron countries where the national regulation is that this data can only be used for the purpose of the original trial or programme of trials).

3

A question for Matthew.  Considering evolving external (non-AZ) datasets, have you taken the approach of ingesting and maintaining the external dataset or simply mapping metadata pointers in the data catalog and ready the externa data from source upon request?

Mathew: Both approaches have been taken. We have developed the ability, with a third party, to perfom cross-repository querying, including internal and external sources. Where it makes sense, and we have the right, to do so, we ingest external data into the lake, if the efficiency gain justifies the effort. We can then set up periodic updates to copy the delta across. There will be may cases where ingestion into the lake is not the path we take, however.

4

Great presentation. How did ASZ decide which dataset to start the FAIRification process for? Does fairifaction require data to be stored in a graph database?

Ian: FAIRification does not necessarily require a graph database. Mathew: FAIRification prioritisation is decided based on user demand.

5

How long it takes to FAIRify a dataset in term of time and effort?

Ian: Time to FAIRify depends on the FAIR objectives which will determine the most important and feasible improvements. See https://fairtoolkit.pistoiaalliance.org for more info about FAIR use cases and methods.

6

You mentionned "Knowledge map to bring the data together, conceptually". Can you expand on that please: format of the knowledge map, how you build it and how you maintain it? Which tools do you use to automate curation ?

Mathew: The knowledge map was built in RDF, originally based on a schema for an exploratory biomarker database developed over many years. The map was desgned by an internal project team with the help of third parties, based on the data we had and the questions we wanted to answer. The map is epxanded when we have new data types or new questions. We do not try to model the whole of Biology!

7

How have you tried to assess the cost/benefits of gathering all the different metadata fields you've considered? It must cost to curate or automatically derive it from potentially multiple stake holders and sources. Have you found a sweet spot between gathering too little and gathering too much?

Mathew: It is a never ending debate. Establishing a platform such as this will incur relatively high costs, but can be brought down through standardisation and automation of processes. As to the sweet spot, this will be driven by our ability to answer questions and we have to learn as we go. 

8

Questions for Mathew - could you please elaborate on AZ's "Data Science Academy" that's due to launch in 6 months?

Mathew: This is an internal training programme, aimed at raising awareness and skills for data science, machine learning and AI for R&D.

9

How do you manage citizen/patient consent when their data is combined for AI by FAIR data providers that leads to new insights that impacts their health value?

Mathew: We are establishing an AZ wide AI ethics policy to address exactly these types of issue. So far, the data has been combined for AI, not by AI, but we want to be ready for that eventuality.

10

You could say quality is kind of implicit and represented in the Reusable principle and somehow the compliance with all the FAIR principles might improve data quality, but as George highlighted in his FAIR applications slide, quality is something additional and complementary to the FAIR principles. In my experience many data analysis problems are not related to FAIR but the lack of quality. What are you doing to assess and validate data quality? What (semi)automatic methods can we use to assess quality?

Mathew: This is right, of course. What I meant was reusability depends on quality - if you do not have quality at source (i.e. metadata captured as part of the business process, with validation and QC along the way) interoperability and thus reusability suffers. We are working with core capability leads to implement quality at source where possible, and are looking at RDM/MDM solutions to enable us to map data to models for the data we receive on completion of the experiment (essentially mapping to a standard model). Data QC is a separate callenge. 

11

How do the FAIR principles and FAIR digital objects co-exist with other open standards such as Decentralized Identifiers being developed by W3C?

Ian: FAIR principles are guidelines rather than standards. FAIR digital objects are a prototypic machine readible artifacts which follow the FAIR guidelines.

12

This comes across as a very well-thought out program / intiative, but these always run into challenges. If you could wave a magic wand to change anything about your approach, what about it would you change to improve its impact or effectiveness for AZ?

Mathew: If it was a really good magic wand, I'd use it to extract tacit knowledge about the experiment or trial from the minds of the data generators in a standardised and structured form. If it was slightly more powerful still, I'd use it to get people to agree on identifier formats.

13

How do you handle the legacy data, as this will be variable? What have been the major obstacles you have found in your approach ? Digital twin: The twin could be infinite in size, how would you manage this, as all metadata accumulated may not be relevant? Semantic enrichment of metadata in the model can lead to false values being identified, which could bias the twin - how do you avoid this?

Mathew: The programme is designed to provide analytics ready data for ML and AI. If the legacy data is sufficiently structured and confomed, or there is enough interest from R&D to map it to a standard structure, we include it. If not, we don't. 

14

For Mathew: Thanks. Interoperability & language: What was your approach to "harmonize" the R&D Science vs. the manufacturing floor language? Both Mathew and Georges reference "enrichment": How this this defined in your cases. Can it be semantic enrichment?

Mathew: We are not currently harmonising the R&D and manufacturing worlds holistically, but we do have touch points of alignment. Our top level common metadata model helps set the framework for alignment. Yes, enruchment is semantic enrichment. We have entry points for defined fields and terms into mapped ontologies that allow us to expand the description of terms in specific contexts.

15

This is a question to Matthew, can you elaborate on the particular reasons why AZ doesn't follow the GO-FAIR guidelines as presented by Erik, please?

Mathew: If I have interpreted this correctly, this refers to the published FAIR principles in Erik's deck? The main difference I highlighted is that for us, Accessible, means that, for regulatory reasons, we know who has accessed what data for what reason, rather that Accessible meaningthat the data is available externally. AZ has an obligation to keep certain data confidential.