...
Define the most common research questions in target discovery and validation. Establish an agreement between the project team that these are indeed the core target discovery business questions, and rank order them by vote by perceived relative importance. If such questions are many, pick the top ones. Establish an agreement on how many exactly. One can use this paper as a starting point for listing of relevant competency questions: https://www.sciencedirect.com/science/article/pii/S1359644613001542 (Failure to identify business questions, or picking too many or too few is a project risk)
Open Targets can serve as a publicly available standardized data source for this use case. Validate that Open Targets either has a ready to use Knowledge Graph implementation, or can be converted into a KG with reasonable cost (this is known project risk)
Select a Large Language Model engine from publicly accessible sources. (Failure to identify a suitable open LLM is a project risk)
Prompt-tuning procedure:
Retrieval Augmented Generation (RAG):
Ask plain English question using prompt-tuned version of one of the questions from the business questions collection
This question is converted into a structured query by an LLM (Failure to generate a proper query for a KG database system is a risk)
Execute this query over a structured controlled data source (e.g. Open Targets DB)
Convert raw output of the query into human-readable input by an LLM
An expert compares this answer or answers with the expected one(s)
An accuracy metric is computed + inputs and outputs saved in some DB
Prompt-tune the opening plain text question to maximize the output quality
Repeat this tuning cycle for all business questions in the collection
Historic (obsolete) versions of the above steps:
Historic vision for the query (this is obsolete and is only preserved here for scope traceability):
Define a plain text data source for mining; one of the choices can be the entire set of paper abstracts indexed in PubMed plus perhaps the entire collection of open-source papers. (Failure to download this large volume of data can be a risk)
Ask a plain English question from the collection of business questions identified above
An LLM uses the question and the data source to produce a human-readable answerHistoric vision for the QA (likely not necessary):
Either same or some other LLM converts this answer to a Knowledge Graph. (KG generation from text is a source of risk)
This answer Knowledge Graph is compared to the KG of the original data source (such as Open Targets). This comparison must be local; in other words, irrelevant sections of the larger knowledge graph should not be considered. (Ability to find a ready KG comparison algorithm or to code it fresh
An accuracy metric is computed + inputs and outputs saved in some DB
Prompt-tune the opening plain text question to maximize the output quality
Repeat this tuning cycle for all business questions in the collection
Use these optimized prompts in RAG, below
Retrieval Augmented Generation (RAG):
Ask plain English question using prompt-tuned version of one of the questions from the business questions collection
- This question is converted into a structured query by an LLM (Failure to generate a proper query for a KG database system
is a risk)
.
Convert raw output of the query into human-readable input by an LLM
- Quality assurance: it is desirable to produce a few instances of questions with known answers, ideally, by human experts, and then use these question-answer pairs as sanity checks of the RAG pipeline
Not in scope:
Proprietary data are not used to train any model and project participants are not asked to share proprietary data in any way. But proprietary data may be analyzed in the context of this set-up by the individual participants using private instances of the pipeline.
Training of a brand new LLM is out of scope. The plan is to only prompt-tune an existing LLM.
...
Phase | Milestones | Deliverables | Est Date |
Initiation | Project charter |
| 12/11/23 (Complete) |
Elaboration |
|
| Q1 2024 |
Construction |
|
| TBDQ3 2024 |
Transition | Sustainability achieved |
| TBD |
Risk Registry
Risks in green are resolved
Risks in yellow are in active research
Risks in white are general in nature
Description | Mitigation | Responsible Party |
Failure to identify business questions, or picking too many or too few | Draft appropriate business questions - DONE; but not all business questions can be answered with specific technologies, so must take this factor into account | Lee Harland, John Wise, Bruce Press, Peter Revill |
| The Hyve Open Targets/EBI:
| |
| Jon Stevens, Etzard Stolte, Helena Deus; Brian Evarts; Wouter Franke, Matthijs van der Zee | |
Failure to generate a proper query for a KG database system by an LLM | Technology research.
| The Hyve Open Targets/EBI: Sebastian Lobentanzer Ellen McDonagh |
Yes in general | The Hyve | |
Failure to download a large volume of data (all of the PubMed as a maximum) for the prompt-tuning of the LLM | CLOSED RISK This may be unnecessary, TBD | |
Failure to perform local KG comparison with calculation of a score | CLOSED RISK | |
Failure to build a prototypical target discovery pipeline on the limited budget in case of mounting technical difficulties | Schedule the project in phases. Aim to answer known unknowns and to establish risk mitigation strategies early in this phase (“project elaboration”) | |
Some proprietary LLMs may be censored, thus introducing uncontrollable bias in the answers that they produce |
| Identified and resolved in the LLM sub-team |
...