We wish to explore the use of Large Language Models for biological research, using target discovery and validation as the initial use case. Target discovery was picked as a use case because it is a common process in all pharmaceutical R&D businesses that requires mining of large volumes of information. We plan to use prompt-tuned LLMs on a highly structured public data resource for the Retrieval-Augmented Generation (RAG) of plain English answers to the typical research questions asked in target discovery. Expected project outputs are a set of guidelines for the most advantageous use of LLMs in research and an open-source target discovery pipeline with prompt-tuned Large Language Models.

Problem Statement

Large Language Models (LLMs) exemplified by ChatGPT 4, attracted a lot of attention recently. However, the best use cases for LLMs in the R&D setting are not well understood, and there is no consensus yet on what a realistic pre-competitive project in this space could be. As an initial use case we propose to create an open-source system for target discovery based on public data and pre-trained LLMs. Target discovery is a common and critical task in drug discovery that typically requires complex data mining of ever-increasing body of knowledge, and placing proprietary research results into the context of public information.

Value Proposition and Expected Results

Alignment with the Pistoia Alliance Strategic Priorities

This project is part of the Artificial Intelligence at Scale strategic priority.

Project Scope

In scope:

Preparatory steps:

Prompt-tuning procedure:

Retrieval Augmented Generation (RAG):

Not in scope:

Fine training of an existing LLM on a body of biomedical knowledge is generally out of scope, but may be considered as a project extension or option, if sufficient quality of results cannot be achieved with prompt-tuning only, and if finances allow.

Project Phases and Milestones

Phase

Milestones

Deliverables

Est Date

Initiation

Project charter

  1. A list of candidate pre-competitive projects

  2. One or more projects selected by vote

  3. Project charter is drafted for the winning idea

  4. Raise minimal funds for the Elaboration phase

12/11/23

(Complete)

Elaboration

  1. Development plan

  2. Cost estimates

  1. Risks analysis – see Risk Registry below

  2. Technology analysis to address the identified risks

  3. Work Breakdown Structure (WBS)

  4. Cost estimates

  5. Time estimates

  6. Gantt Chart for Construction with additional iterations as needed and a work schedule

  7. Make feasibility decisions before committing to build

Q1 2024

Construction

  1. Target discovery pipeline

  2. Lessons learned published

  1. Target discovery pipeline – detailed deliverables are not yet known

  2. Lessons learned recorded and published

TBD

Transition

Sustainability achieved

  1. Place the prototype into maintenance mode or outsource for continuous development by another organization (e.g. non-profit)

  2. Plan extension work, if any

TBD

Risk Registry

Description

Mitigation

Failure to identify business questions, or picking too many or too few

Establish a consensus on the minimal number of business questions

Validate that Open Targets either has a ready to use Knowledge Graph implementation, or can be converted into a KG with reasonable cost

  1. Technology research - review Open Targets

  2. Review preliminary work done at Abbvie

  3. If no KG is available, estimate the conversion process

  4. If estimates indicate infeasibility, this may become a gap

Failure to identify a suitable open LLM

This is not yet known and represents a gap

Failure to download a large volume of data (all of the PubMed as a maximum) for the prompt-tuning of the LLM

This is not yet known and represents a gap

Failure to perform KG generation from text by an LLM

  1. Technology research

  2. If no ready-to-use technology exists, estimate bespoke development (tuning an existing LLM for this purpose)

Failure to perform local KG comparison with calculation of a score

  1. Technology research

  2. If no ready-to-use technology exists, estimate bespoke development

  3. If estimates indicate infeasibility, this may become a gap

Failure to generate a proper query for a KG database system by an LLM

Technology research. Code generation by LLMs is a common task, so this risk may be seen as low

Failure to build a prototypical target discovery pipeline on the limited budget in case of mounting technical difficulties

Schedule the project in phases. Aim to answer known unknowns and to establish risk mitigation strategies early in this phase (“project elaboration”)

Project Stakeholders

Sponsors:

Stakeholder mailing list in Google Groups: https://groups.google.com/a/pistoiaalliance.org/g/llm-project