Lessons Learned

Assay descriptions contained in the appropriate field in ChEMBL have variable depth and completeness. For that reason we cannot rely on them when we conduct QC of our own annotations. On the other hand, this also means that our annotations are likely to deliver more value.
Papers cited in ChEMBL may not contain suitable assay descriptions; instead, one may have to read multiple references and supplemental information at a potentially very high cost for access to papers
In the published academic papers there are errors in assay descriptions, that propagate between papers. When assay panels are cited in peer-reviewed literature, links to vendor assay panels are often dead, because vendors go out of business or merge
Commercial assay panel descriptions are easy to obtain and vendors do not object to using them for data extraction
There are three sources for assay annotations: peer-reviewed papers (hardest to work with), commercial assay panels (easier to work with), and assay annotations already in PubChem in form of plain text (easiest, but also the smallest group). Many academic papers use commercial assay panels. Based on this, and also on ease of access, commercial assay panels is the best source of information
Using published papers to extract the assay descriptions falls under the fair use doctrine, and thus does not require copyright fees to publishers (above and beyond subscription or paper access fees). This was confirmed in a legal opinion in 2020
Academic journals vary in quality of the published assay descriptions. Nature Chem Bio has the least number of ambiguous details, while ACS Med Chem Lett has the largest. Thus it would make sense to work with publishers to establish a standard for disclosure of assay details (minimal information model or good assay publishing practices). It can be published on fairsharing.org site. In general, do include publishers in the full-scale project
Maintain an audit trail
Maintain “needs revision” status on individual annotation fields
Version annotations
Use an iterative QC process with multiple rounds and independent workers on each round
Annotators must be experts
Time spent on annotating assays varies from 1 to 28 minutes per assay, with mean of 5.8 minutes
In annotation QC, the rate of significant changes is about 10/57 = 17.5%
In a large-scale project may have to iteratively refine the NLP model used for automatic annotation
Cost of annotation of one assay is on average:
Primary assays are more interesting than secondary ones (because selectivity assays may have less rigor than the primary assay). Hence, focus on publications that report primary target assays, and for selectivity assays, consider first and foremost those that have multiple concentrations and not just one.