Boston 2019 Summary and Materials

On October 22, 2019 we ran a series of workshops on FAIR Data Implementation, Best Practices in AI and ML, and on DataFAIRy, an emergent project aiming to convert biochemical assays into a FAIR format.

John Overington, Lihua Yu, Al Wang, and Katherine Gibson (left to right) in the evening panel discussion.



Here is the summary of the main points of these discussions.

FAIR Implementation Workshop

Follow this link

Best Practices in AI Workshop

Context: the following ideas form the core of our best practices in AI and ML manifesto paper:

  1. Develop application domain knowledge

  2. Train models on top-quality data

  3. FAIR for data life cycle planning

  4. Publish model code, and testing and training data, along with model results

  5. Use a model versioning system

  6. Select AI/ML methods that fit the problem

  7. Set right management expectations, educate colleagues

  8. Combine AI models and human decision-making

  9. Experiment, scale up fast and fail fast

  10. CoE for “moonshot” innovation challenges

Common themes and main points that emerged in the brainstorming session:

Validation and Benchmarks

  • Are testing and training sets distributed in the same way?

  • Possible action: create benchmarks for typical tasks of our industry

  • Priority by demand

  • Risk of overfitting

Common Language

  • Use language that bench science domain experts can relate to

  • Process alignment between different parts of an organization

  • Trans-disciplinary experts


  • When to use AI and when not?

  • Method selection

  • Data assessment in organization

What is “quality data”?

  • Quality dimensions

  • Profile your data

  • Context of data – is my data unique?

  • Is it feasible to use only quality data in model building?

  • Completeness, Accuracy, Coherence, Timeliness

Quality Models

  • Retracted datasets problem

  • Continuous update of models

  • Compare AI/ML model to null model

  • Define success for the model-building effort

  • Build multiple models and compare on the same benchmark?

  • Do hypothesis built into model and that of user match?

  • Do other models reflect my users’ hypothesis?

  • Quality metadata for training

DataFAIRy Workshop

Context: DataFAIRy project for public bio-assay annotation

The challenge

  • Convert published assay protocol text into high quality FAIR data

  • Order of 105 -106 published open access bioassay protocols exist today, excluding patents

Proposal: a collaboratively funded curation initiative

Curation model

  • NLP + vetted human expert review and public ontologies (BAO and others)

  • Paying partners access the data immediately. Released to the public domain after an embargo period.

Why this is a good idea?

  • Increases the bulk of FAIR scientific data, available to partners, and subsequently the global scientific community.

  • Enables more efficient searches for e.g., what types of assays have been used for a target or pathway of interest

  • Facilitates more efficient bench science (assay conditions, tool compounds),

  • Facilitates creation of integrated datasets for predictive modelling (eg., compound toxicity, selectivity and target activity).

  • Costs reduction compared to each organization paying the full cost of high quality curation.

  • Enables automated data mining of assay information

  • Provides a real life example of richly annotated FAIR data generated through collaboration

Current state:

  • PoC defined. Fundraising for PoC led by AstraZeneca + Roche

  • Seeking additional industry partners

Future extensions:

  • Massive assay protocol processing

  • Other data beyond open access assay protocols.

Common themes and main points that emerged in the brainstorming session:

Definition of Success

  • Assay reproducibility

  • Or, being able to combine results from 2+ independent assays

  • What is minimum set of metadata needed for reproducibility?

Selection of Pilot Assays

  • Need small interesting benchmark

  • Or align assay choice with CROs, Abvance, Charles River, Millipore

  • Or ADMET?

Minimal Viable Product (MVP) Definition

  • Do competitive analysis, review already existing systems: BART, BAO, MIABE .. MIAME

  • What is new in the proposed system in comparison?

  • Prioritize robust assays

  • Pilot emphasizing business case would be better than a technology demonstration

  • Define target audience

Retrospective or prospective?


We plan to incorporate these ideas into the project plan for the DataFAIRy and in the best practices paper.

Many thanks to all participants!