NLP

Draft Title: NLP

Date: November 2021

Who to Contact: projectenquiries@pistoiaalliance.org

Project Idea Description:

Pharma companies apply NLP methods in hopes of automation and insight generation. Although NLP algorithms have matured quite a bit during the past years, practical value for most NLP pilots tends to be poor, and very few NLP driven projects are seen through to production. Exceptions are typically topics w/ good meta data quality, large amounts of training sets and willing business colleagues to verify results, and a serendipitous combination of technical expertise and suitable use cases. Roche manufacturing, for example, has benchmarked the quality of many NLP methods / pipelines for concept extraction, synonym expansion and taxonomy generation. As a result they have internal expertise to drop taxonomy generation, keep synonym expansion as a mature approach, and have some document types and algorithm pairs for which concept extraction works in sufficient quality. This is the type of knowledge that could be of value to share in a pre-competitive manner among Pistoia Alliance members. A simple database could contain characterisation of use case, data characterisation, pipelines & algorithms used, quality criteria, outcomes, comments.

To benefit from this new opportunity, every pharma company has tested out NLP with their own records and often the same libraries / platforms (such as SpaCy), yet, only in rare cases NLP pilots reach a quality sufficient for daily use by our colleagues. While NLP platforms advance every year, typically successful use cases have very specific applications and tend to require difficult to find training / optimisation parameters. In this initiative participating members share all relevant successful AND un-successful NLP use cases from their companies. The goal will be to gather a short list of parameters that explain use case, specific algorithm used, training and data tried, etc. The value for participating members would be a reference database to help narrow down potentially successful use case scenarios, less experimentation and more successes.