Lessons Learned

  1. Assay descriptions contained in the appropriate field in ChEMBL have variable depth and completeness. For that reason we cannot rely on them when we conduct QC of our own annotations. On the other hand, this also means that our annotations are likely to deliver more value. Also, activity data in ChEMBL are not trustworthy
  2. Papers cited in ChEMBL may not contain suitable assay descriptions; instead, one may have to read multiple references and supplemental information at a potentially very high cost for access to papers
  3. In the published academic papers there are errors in assay descriptions, that propagate between papers. When assay panels are cited in peer-reviewed literature, links to vendor assay panels are often dead, because vendors go out of business or merge
  4. Commercial assay panel descriptions are easy to obtain and vendors do not object to using them for data extraction
  5. There are three sources for assay annotations: peer-reviewed papers (hardest to work with), commercial assay panels (easier to work with), and assay annotations already in PubChem in form of plain text (easiest, but also the smallest group). Many academic papers use commercial assay panels. Based on this, and also on ease of access, commercial assay panels is the best source of information
  6. Using published papers to extract the assay descriptions falls under the fair use doctrine, and thus does not require copyright fees to publishers (above and beyond subscription or paper access fees). This was confirmed in a legal opinion in 2020
  7. Academic journals vary in quality of the published assay descriptions. Nature Chem Bio has the least number of ambiguous details, while ACS Med Chem Lett has the largest. Thus it would make sense to work with publishers to establish a standard for disclosure of assay details (minimal information model or good assay publishing practices). It can be published on fairsharing.org site (allows for assignment of a permanent DOI). In general, do include publishers in the full-scale project
  8. Explore whether it would be possible to establish a requirement to deposit already FAIR assay annotations to public databanks prior to publication, and to refer to these published assay annotations in the Methods section of the papers
  9. There was need to add fields to the annotation model and to extend the BioAssay Ontology (BAO). Involve BAO in the full-scale project
  10. Create a public depository of obsolete assays (e.g. in PubChem)
  11. Maintain an audit trail
  12. Maintain “needs revision” status on individual annotation fields
  13. Version annotations
  14. Use an iterative QC process with multiple rounds and independent workers on each round
  15. Annotators must be experts
  16. Time spent on annotating assays varies from 1 to 28 minutes per assay, with mean of 5.8 minutes
  17. In annotation QC, the rate of significant changes is about 10/57 = 17.5%
  18. In a large-scale project may have to iteratively refine the NLP model used for automatic annotation
  19. Cost of annotation of one assay is on average $30 + cost of time of volunteer QC team
  20. Introduce a “useless data” flag
  21. Primary assays are more interesting than secondary ones (because selectivity assays may have less rigor than the primary assay). Hence, focus on publications that report primary target assays, and for selectivity assays, consider first and foremost those that have multiple concentrations and not just one.