...
Define the most common research questions in target discovery and validation. Establish an agreement between the project team that these are indeed the core target discovery business questions, and rank order them by vote by perceived relative importance. If such questions are many, pick the top ones. Establish an agreement on how many exactly. One can use this paper as a starting point for listing of relevant competency questions: https://www.sciencedirect.com/science/article/pii/S1359644613001542 (Failure to identify business questions , or picking too many or too few is was a project risk)
Open Targets can serve as a publicly available standardized data source for this use case. Validate that Open Targets either has a ready to use Knowledge Graph implementation, or can be converted into a KG with reasonable cost (this is known project riskwas a project risk; we established data availability for the Open Targets as a KG in BioCypher, and this risk is now closed)
Select a Large Language Model engine from publicly accessible sources. (Failure to identify a suitable open LLM is was a project risk. The available LLMs were analyzed and this risk is now closed)
Prompt-tuning procedure:
Retrieval Augmented Generation (RAG):
Ask plain English question using prompt-tuned version of one of the questions from the business questions collection
This question is converted into a structured query by an LLM (Failure to generate a proper query for a KG database system is a risk)
Execute this query over a structured controlled data source (e.g. Open Targets DB)
Convert raw output of the query into human-readable input (by an LLM or by other means)
An expert compares this answer or answers with the expected one(s)
An accuracy metric is computed + inputs and outputs saved in some DB
Prompt-tune the opening plain text question to maximize the output quality
Experiment with different modes of using an LLM (such as LLM agents, or query templates, or different representations of the data source schema) to maximize the output quality
Repeat this tuning cycle for all business questions in the collection
...
Description | Mitigation | Responsible Party |
Failure to identify business questions, or picking too many or too few | Draft appropriate business questions - DONE; but not all business questions can be answered with specific technologies, so must take this factor into account | Lee Harland, John Wise, Bruce Press, Peter Revill |
| The Hyve Open Targets/EBI:
| |
| Jon Stevens, Etzard Stolte, Helena Deus; Brian Evarts; Wouter Franke, Matthijs van der Zee | |
Failure to generate a proper query for a KG database system by an LLM | Technology research.
| The Hyve Open Targets/EBI: Sebastian Lobentanzer Ellen McDonagh |
Yes in general | The Hyve | |
Failure to download a large volume of data (all of the PubMed as a maximum) for the prompt-tuning of the LLM | CLOSED RISK This may be unnecessary, TBDis not necessary | |
Failure to perform local KG comparison with calculation of a score | CLOSED RISK This is not necessary - we can compare the output manually | |
Failure to build a prototypical target discovery pipeline on the limited budget in case of mounting technical difficulties | CLOSED: Schedule the project in phases. Aim to answer known unknowns and to establish risk mitigation strategies early. | It is not yet known whether a product will be built. For now the scope is focused on the technology analysis for the POC |
Some proprietary LLMs may be censored, thus introducing uncontrollable bias in the answers that they produce | DONE: CLOSED. Censorship may already be included in the performance scores, so this is taken care of in the comparison of the LLMs. However, there is team preference for open-source and uncensored LLMs | Identified and resolved in the LLM sub-team |
...
2024.02.07 Recording (Passcode: L58@v7Dg) Slides | Slides from the talk by Sebastian Lobentanzer
2024.03.20 Recording (Passcode: LZ!jZT4z) Slides | Architecture diagram in Draw.io | Architecture diagram PNG file
2024.04.17 Recording (Passcode: Yn2!5qJK) Slides | Slides from the talk by Jon Stevens
2024.08.07 Recording (Passcode: %.1&ukfM) Slides | Includes a talk by Peter Dorr: SPARQL query code generation with LLMs
2024.09.04 Recording (Passcode: t3?B*?CX) Slides | Includes a talk by Oleg Stroganov on agents controlling the actions of LLMs | Slides from the talk by Oleg Stroganov
2024.11.12 Email communication: Slides from the report by Oleg Stroganov
2024.11.20 Recording (Passcode: EE!C54u#) Slides | Slides by Oleg Stroganov with an update
2024.12.04 Recording (Passcode:E.?p#b$9) Slides | Slides by Oleg Stroganov with an update
2024.12.18 Recording (Passcode: $O9uxYXy) Slides | Slides by Oleg Stroganov with an update
...