Large Language Models in Biological R&D

We wish to explore the use of Large Language Models for biological research, using target discovery and validation as the initial use case. Target discovery was picked as a use case because it is a common process in all pharmaceutical R&D businesses that requires mining of large volumes of information. We plan to use prompt-tuned LLMs on a highly structured public data resource for the Retrieval-Augmented Generation (RAG) of plain English answers to the typical research questions asked in target discovery. Expected project outputs are a set of guidelines for the most advantageous use of LLMs in research and an open-source target discovery pipeline with prompt-tuned Large Language Models.

Problem Statement

Large Language Models (LLMs) exemplified by ChatGPT 4, attracted a lot of attention recently. However, the best use cases for LLMs in the R&D setting are not well understood, and there is no consensus yet on what a realistic pre-competitive project in this space could be. As an initial use case we propose to create an open-source system for target discovery based on public data and pre-trained LLMs. Target discovery is a common and critical task in drug discovery that typically requires complex data mining of ever-increasing body of knowledge, and placing proprietary research results into the context of public information.

Value Proposition and Expected Results

  • The proposed approach would allow for natural language queries to be effectively translated into structured queries, executed over standardized data sources (such as, for instance, Open Targets), and converted into human-readable outputs.

  • The project does not require participating companies to disclose any of their proprietary data. However, they can mine their proprietary data by using private instances of the described pipeline.

  • One significant expected outcome includes lessons learned on the best practices for deployment, prompt-tuning, fine training, and limitations of applicability of LLMs for research purposes. We will seek to publish these lessons learned for the benefit of the research community.

  • Another significant outcome can be an open-source target discovery pipeline prototype itself.

  • Improved efficiency and accuracy in target discovery and validation.

  • Creation of a framework that can be used for other use cases:

    • A model of project execution for other pre-competitive core model work.

    • Additional prototypes for other common discovery tasks can be created if/when more suitable use cases are identified.

Alignment with the Pistoia Alliance Strategic Priorities

This project is part of the Artificial Intelligence at Scale strategic priority.

Project Scope

In scope:

Preparatory steps:

  • Define the most common research questions in target discovery and validation. Establish an agreement between the project team that these are indeed the core target discovery business questions, and rank order them by vote by perceived relative importance. If such questions are many, pick the top ones. Establish an agreement on how many exactly. One can use this paper as a starting point for listing of relevant competency questions: https://www.sciencedirect.com/science/article/pii/S1359644613001542 (Failure to identify business questions, or picking too many or too few is a project risk)

  • Open Targets can serve as a publicly available standardized data source for this use case. Validate that Open Targets either has a ready to use Knowledge Graph implementation, or can be converted into a KG with reasonable cost (this is known project risk)

  • Select a Large Language Model engine from publicly accessible sources. (Failure to identify a suitable open LLM is a project risk)

Prompt-tuning procedure:

  • Retrieval Augmented Generation (RAG):

    • Ask plain English question using prompt-tuned version of one of the questions from the business questions collection

    • This question is converted into a structured query by an LLM (Failure to generate a proper query for a KG database system is a risk)

    • Execute this query over a structured controlled data source (e.g. Open Targets DB)

    • Convert raw output of the query into human-readable input by an LLM

    • An expert compares this answer or answers with the expected one(s)

  • An accuracy metric is computed + inputs and outputs saved in some DB

  • Prompt-tune the opening plain text question to maximize the output quality

  • Repeat this tuning cycle for all business questions in the collection

Historic (obsolete) versions of the above steps:

  • Historic vision for the query (this is obsolete and is only preserved here for scope traceability):

    • Define a plain text data source for mining; one of the choices can be the entire set of paper abstracts indexed in PubMed plus perhaps the entire collection of open-source papers. (Failure to download this large volume of data can be a risk)

  • Historic vision for the QA (likely not necessary):

    • Either same or some other LLM converts this answer to a Knowledge Graph. (KG generation from text is a source of risk)

    • This answer Knowledge Graph is compared to the KG of the original data source (such as Open Targets). This comparison must be local; in other words, irrelevant sections of the larger knowledge graph should not be considered. (Ability to find a ready KG comparison algorithm or to code it fresh is a risk).

Not in scope:

  • Proprietary data are not used to train any model and project participants are not asked to share proprietary data in any way. But proprietary data may be analyzed in the context of this set-up by the individual participants using private instances of the pipeline.

  • Training of a brand new LLM is out of scope. The plan is to only prompt-tune an existing LLM.

Fine training of an existing LLM on a body of biomedical knowledge is generally out of scope, but may be considered as a project extension or option, if sufficient quality of results cannot be achieved with prompt-tuning only, and if finances allow.

Project Phases and Milestones

Phase

Milestones

Deliverables

Est Date

Initiation

Project charter

  1. A list of candidate pre-competitive projects

  2. One or more projects selected by vote

  3. Project charter is drafted for the winning idea

  4. Raise minimal funds for the Elaboration phase

12/11/23

(Complete)

Elaboration

  1. Development plan

  2. Cost estimates

  1. Risks analysis – see Risk Registry below

  2. Technology analysis to address the identified risks

  3. Work Breakdown Structure (WBS)

  4. Cost estimates

  5. Time estimates

  6. Gantt Chart for Construction with additional iterations as needed and a work schedule

  7. Make feasibility decisions before committing to build

Q1 2024

Construction

  1. Target discovery pipeline

  2. Lessons learned published

  1. Target discovery pipeline – detailed deliverables are not yet known

  2. Lessons learned recorded and published

Q3 2024

Transition

Sustainability achieved

  1. Place the prototype into maintenance mode or outsource for continuous development by another organization (e.g. non-profit)

  2. Plan extension work, if any

TBD

Risk Registry

Risks in green are resolved

Risks in yellow are in active research

Risks in white are general in nature

Description

Mitigation

Responsible Party

Failure to identify business questions, or picking too many or too few

Draft appropriate business questions - DONE; but not all business questions can be answered with specific technologies, so must take this factor into account

Lee Harland, John Wise, Bruce Press, Peter Revill

Validate that Open Targets either has a ready to use Knowledge Graph implementation, or can be converted into a KG with reasonable cost

 

The Hyve
Jordan Ramsdell
Robert Gill
Brian Evarts

 

Open Targets/EBI:

  • Sebastian Lobentanzer

  • Ellen McDonagh

Failure to identify a suitable LLM

  • See this comparison

  • Recommend to focus on the Cypher query generation ability as the key risk (below)

  • Start with one open-source and one closed-source LLMs (say Mistral and GPT 4) and agree to explore others later, and meanwhile close this risk

Jon Stevens, Etzard Stolte, Helena Deus; Brian Evarts; Wouter Franke, Matthijs van der Zee

Failure to generate a proper query for a KG database system by an LLM

Technology research.

  • See refs 7, 8, 13, 14 below

  • BioCypher by EBI may have this capability already - needs evaluation

The Hyve
Jordan Ramsdell
Robert Gill
Brian Evarts

 

Open Targets/EBI:

Sebastian Lobentanzer

Ellen McDonagh

Does Open Targets use an ontology?

Yes in general

The Hyve

Failure to download a large volume of data (all of the PubMed as a maximum) for the prompt-tuning of the LLM

CLOSED RISK

This may be unnecessary, TBD

 

Failure to perform local KG comparison with calculation of a score

CLOSED RISK

 

Failure to build a prototypical target discovery pipeline on the limited budget in case of mounting technical difficulties

Schedule the project in phases. Aim to answer known unknowns and to establish risk mitigation strategies early in this phase (“project elaboration”)

 

Some proprietary LLMs may be censored, thus introducing uncontrollable bias in the answers that they produce

  • DONE: Censorship may already be included in the performance scores, so this is taken care of in the comparison of the LLMs. However, there is team preference for open-source and uncensored LLMs

Identified and resolved in the LLM sub-team

Project Stakeholders

Sponsors:

  • Lars Greiffenberg, Abbvie

  • Brian Martin, AstraZeneca

Project Participants:

Stakeholder mailing list in Google Groups: https://groups.google.com/a/pistoiaalliance.org/g/llm-project

Meetings

Every other week at 8 am PST (= 11 am EST = 4 pm London = 5 pm Berlin) starting on January 10th, 2024

Lessons Learned

  • The highest risk item is generation of the structured query (Cyphrer or SPARQL) from a plain English request. Some publications estimate success rate of about 48% on the first attempt.

  • The structure of the database used for queries matters. LLMs can easier produce meaningful structured queries for databases with flat, simple structure.

  • Practically useful system requires filtering or secondary mining of output in addition to natural language narration.

  • It is extremely important to implement a reliable named entity recognition system. The same acronym can refer to completely different entities, which can be differentiated either from the context (hard) or by asking clarifying questions. Must also map synonyms. Without these measures naïve queries in a RAG environment will fail.

References

  1. https://www.sciencedirect.com/science/article/pii/S1359644613001542

  2. https://www.nature.com/articles/s41573-020-0087-3

  3. https://www.epam.com/about/newsroom/press-releases/2023/epam-launches-dial-a-unified-generative-ai-orchestration-platform

  4. https://epam-rail.com/open-source

  5. Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

  6. Chatbot Arena: https://chat.lmsys.org/?arena

  7. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

    https://arxiv.org/abs/2310.01061

  8. Knowledge-Consistent Dialogue Generation with Language Models and Knowledge Graphs

    https://openreview.net/forum?id=WhWlYzUTJfP&source=post_page-----97a4cf96eb69--------------------------------

  9. BioChatter Benchmark Results: https://biochatter.org/benchmark-results/#biochatter-query-generation

  10. MBET Benchmark (embeddings) https://huggingface.co/spaces/mteb/leaderboard

  11. Lora-Land and Lorax: https://predibase.com/lora-land

  12. A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases. Summary: queries over a KG with GPT 4 are much more accurate than queries over a SQL database with GPT 4. https://arxiv.org/abs/2311.07509

  13. https://towardsdatascience.com/evaluating-llms-in-cypher-statement-generation-c570884089b3

  14. https://medium.com/neo4j/enhancing-the-accuracy-of-rag-applications-with-knowledge-graphs-ad5e2ffab663

  15. linkedlifedata.com

  16. Kazu - Biomedical NLP Framework: https://github.com/AstraZeneca/KAZU

  17. https://github.com/f/awesome-chatgpt-prompts/tree/main

  18. Zhou, L., Schellaert, W., Martínez-Plumed, F. et al. Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024). https://doi.org/10.1038/s41586-024-07930-y

  19. Karthik Soman, Peter W Rose, John H Morris, Rabia E Akbas, Brett Smith, Braian Peetoom, Catalina Villouta-Reyes, Gabriel Cerono, Yongmei Shi, Angela Rizk-Jackson, Sharat Israni, Charlotte A Nelson, Sui Huang, Sergio E Baranzini, Biomedical knowledge graph-optimized prompt generation for large language models, Bioinformatics, Volume 40, Issue 9, September 2024, btae560, https://doi.org/10.1093/bioinformatics/btae560

  20. https://www.promptingguide.ai/