Boston 2019 Summary and Materials
On October 22, 2019 we ran a series of workshops on FAIR Data Implementation, Best Practices in AI and ML, and on DataFAIRy, an emergent project aiming to convert biochemical assays into a FAIR format.
John Overington, Lihua Yu, Al Wang, and Katherine Gibson (left to right) in the evening panel discussion.
Here is the summary of the main points of these discussions.
FAIR Implementation Workshop
Follow this link
Best Practices in AI Workshop
Context: the following ideas form the core of our best practices in AI and ML manifesto paper:
Develop application domain knowledge
Train models on top-quality data
FAIR for data life cycle planning
Publish model code, and testing and training data, along with model results
Use a model versioning system
Select AI/ML methods that fit the problem
Set right management expectations, educate colleagues
Combine AI models and human decision-making
Experiment, scale up fast and fail fast
CoE for “moonshot” innovation challenges
Common themes and main points that emerged in the brainstorming session:
Validation and Benchmarks
Are testing and training sets distributed in the same way?
Possible action: create benchmarks for typical tasks of our industry
Priority by demand
Risk of overfitting
Common Language
Use language that bench science domain experts can relate to
Process alignment between different parts of an organization
Trans-disciplinary experts
Strategy
When to use AI and when not?
Method selection
Data assessment in organization
What is “quality data”?
Quality dimensions
Profile your data
Context of data – is my data unique?
Is it feasible to use only quality data in model building?
Completeness, Accuracy, Coherence, Timeliness
Quality Models
Retracted datasets problem
Continuous update of models
Compare AI/ML model to null model
Define success for the model-building effort
Build multiple models and compare on the same benchmark?
Do hypothesis built into model and that of user match?
Do other models reflect my users’ hypothesis?
Quality metadata for training
DataFAIRy Workshop
Context: DataFAIRy project for public bio-assay annotation
The challenge
Convert published assay protocol text into high quality FAIR data
Order of 105 -106 published open access bioassay protocols exist today, excluding patents
Proposal: a collaboratively funded curation initiative
Curation model
NLP + vetted human expert review and public ontologies (BAO and others)
Paying partners access the data immediately. Released to the public domain after an embargo period.
Why this is a good idea?
Increases the bulk of FAIR scientific data, available to partners, and subsequently the global scientific community.
Enables more efficient searches for e.g., what types of assays have been used for a target or pathway of interest
Facilitates more efficient bench science (assay conditions, tool compounds),
Facilitates creation of integrated datasets for predictive modelling (eg., compound toxicity, selectivity and target activity).
Costs reduction compared to each organization paying the full cost of high quality curation.
Enables automated data mining of assay information
Provides a real life example of richly annotated FAIR data generated through collaboration
Current state:
PoC defined. Fundraising for PoC led by AstraZeneca + Roche
Seeking additional industry partners
Future extensions:
Massive assay protocol processing
Other data beyond open access assay protocols.
Common themes and main points that emerged in the brainstorming session:
Definition of Success
Assay reproducibility
Or, being able to combine results from 2+ independent assays
What is minimum set of metadata needed for reproducibility?
Selection of Pilot Assays
Need small interesting benchmark
Or align assay choice with CROs, Abvance, Charles River, Millipore
Or ADMET?
Minimal Viable Product (MVP) Definition
Do competitive analysis, review already existing systems: BART, BAO, MIABE .. MIAME
What is new in the proposed system in comparison?
Prioritize robust assays
Pilot emphasizing business case would be better than a technology demonstration
Define target audience
Retrospective or prospective?
We plan to incorporate these ideas into the project plan for the DataFAIRy and in the best practices paper.
Many thanks to all participants!