Boston 2019 Summary and Materials

On October 22, 2019 we ran a series of workshops on FAIR Data Implementation, Best Practices in AI and ML, and on DataFAIRy, an emergent project aiming to convert biochemical assays into a FAIR format.

John Overington, Lihua Yu, Al Wang, and Katherine Gibson (left to right) in the evening panel discussion.

Here is the summary of the main points of these discussions.

FAIR Implementation Workshop

Follow this link

Best Practices in AI Workshop

Context: the following ideas form the core of our best practices in AI and ML manifesto paper:

Develop application domain knowledge
Train models on top-quality data
FAIR for data life cycle planning
Publish model code, and testing and training data, along with model results
Use a model versioning system
Select AI/ML methods that fit the problem
Set right management expectations, educate colleagues
Combine AI models and human decision-making
Experiment, scale up fast and fail fast
CoE for “moonshot” innovation challenges

Common themes and main points that emerged in the brainstorming session:

Validation and Benchmarks

Are testing and training sets distributed in the same way?
Possible action: create benchmarks for typical tasks of our industry
Priority by demand
Risk of overfitting

Common Language

Use language that bench science domain experts can relate to
Process alignment between different parts of an organization
Trans-disciplinary experts

Strategy

When to use AI and when not?
Method selection
Data assessment in organization

What is “quality data”?

Quality dimensions
Profile your data
Context of data – is my data unique?
Is it feasible to use only quality data in model building?
Completeness, Accuracy, Coherence, Timeliness

Quality Models

Retracted datasets problem
Continuous update of models
Compare AI/ML model to null model
Define success for the model-building effort
Build multiple models and compare on the same benchmark?
Do hypothesis built into model and that of user match?
Do other models reflect my users’ hypothesis?
Quality metadata for training

DataFAIRy Workshop

Context: DataFAIRy project for public bio-assay annotation

The challenge

Convert published assay protocol text into high quality FAIR data
Order of 10⁵-10⁶published open access bioassay protocols exist today, excluding patents

Proposal: a collaboratively funded curation initiative

Curation model

NLP + vetted human expert review and public ontologies (BAO and others)
Paying partners access the data immediately. Released to the public domain after an embargo period.

Why this is a good idea?

Increases the bulk of FAIR scientific data, available to partners, and subsequently the global scientific community.
Enables more efficient searches for e.g., what types of assays have been used for a target or pathway of interest
Facilitates more efficient bench science (assay conditions, tool compounds),
Facilitates creation of integrated datasets for predictive modelling (eg., compound toxicity, selectivity and target activity).
Costs reduction compared to each organization paying the full cost of high quality curation.
Enables automated data mining of assay information
Provides a real life example of richly annotated FAIR data generated through collaboration

Current state:

PoC defined. Fundraising for PoC led by AstraZeneca + Roche
Seeking additional industry partners

Future extensions:

Massive assay protocol processing
Other data beyond open access assay protocols.

Common themes and main points that emerged in the brainstorming session:

Definition of Success

Assay reproducibility
Or, being able to combine results from 2+ independent assays
What is minimum set of metadata needed for reproducibility?

Selection of Pilot Assays

Need small interesting benchmark
Or align assay choice with CROs, Abvance, Charles River, Millipore
Or ADMET?

Minimal Viable Product (MVP) Definition

Do competitive analysis, review already existing systems: BART, BAO, MIABE .. MIAME
What is new in the proposed system in comparison?
Prioritize robust assays
Pilot emphasizing business case would be better than a technology demonstration
Define target audience

Retrospective or prospective?

We plan to incorporate these ideas into the project plan for the DataFAIRy and in the best practices paper.

Many thanks to all participants!