...

Define the most common research questions in target discovery and validation. Establish an agreement between the project team that these are indeed the core target discovery business questions, and rank order them by vote by perceived relative importance. If such questions are many, pick the top ones. Establish an agreement on how many exactly. One can use this paper as a starting point for listing of relevant competency questions: https://www.sciencedirect.com/science/article/pii/S1359644613001542 (Failure to identify business questions , or picking too many or too few is was a project risk)
Open Targets can serve as a publicly available standardized data source for this use case. Validate that Open Targets either has a ready to use Knowledge Graph implementation, or can be converted into a KG with reasonable cost (this is known project riskwas a project risk; we established data availability for the Open Targets as a KG in BioCypher, and this risk is now closed)
Select a Large Language Model engine from publicly accessible sources. (Failure to identify a suitable open LLM is was a project risk. The available LLMs were analyzed and this risk is now closed)

Prompt-tuning procedure:

Retrieval Augmented Generation (RAG):
- Ask plain English question using prompt-tuned version of one of the questions from the business questions collection
- This question is converted into a structured query by an LLM (Failure to generate a proper query for a KG database system is a risk)
- Execute this query over a structured controlled data source (e.g. Open Targets DB)
- Convert raw output of the query into human-readable input (by an LLM or by other means)
- An expert compares this answer or answers with the expected one(s)
An accuracy metric is computed + inputs and outputs saved in some DB
Prompt-tune the opening plain text question to maximize the output quality
Experiment with different modes of using an LLM (such as LLM agents, or query templates, or different representations of the data source schema) to maximize the output quality
Repeat this tuning cycle for all business questions in the collection

Historic (obsolete) versions of the above steps:

Historic vision for the query (this is obsolete and is only preserved here for scope traceability):
- Define a plain text data source for mining; one of the choices can be the entire set of paper abstracts indexed in PubMed plus perhaps the entire collection of open-source papers. (Failure to download this large volume of data can be a risk)
Ask a plain English question from the collection of business questions identified above
An LLM uses the question and the data source to produce a human-readable answer
Historic vision for the QA (likely not necessary):
- Either same or some other LLM converts this answer to a Knowledge Graph. (KG generation from text is a source of risk)
- This answer Knowledge Graph is compared to the KG of the original data source (such as Open Targets). This comparison must be local; in other words, irrelevant sections of the larger knowledge graph should not be considered. (Ability to find a ready KG comparison algorithm or to code it fresh is a risk).
An accuracy metric is computed + inputs and outputs saved in some DB
Prompt-tune the opening plain text question to maximize the output quality
Repeat this tuning cycle for all business questions in the collection
Use these optimized prompts in RAG, below

Retrieval Augmented Generation (RAG):

Ask plain English question using prompt-tuned version of one of the questions from the business questions collection
This question is converted into a structured query by an LLM (Failure to generate a proper query for a KG database system is a risk)
Execute this query over a structured controlled data source (e.g. Open Targets DB)
Convert raw output of the query into human-readable input by an LLM
Quality assurance: it is desirable to produce a few instances of questions with known answers, ideally, by human experts, and then use these question-answer pairs as sanity checks of the RAG pipeline

Not in scope:

Proprietary data are not used to train any model and project participants are not asked to share proprietary data in any way. But proprietary data may be analyzed in the context of this set-up by the individual participants using private instances of the pipeline.
Training of a brand new LLM is out of scope. The plan is to only prompt-tune an existing LLM.

...

Phase	Milestones	Deliverables	Est Date
Initiation	Project charter	A list of candidate pre-competitive projects One or more projects selected by vote Project charter is drafted for the winning idea Raise minimal funds for the Elaboration phase	12/11/23 (Complete)
Elaboration	Development plan Cost estimates	Risks analysis – see Risk Registry below Technology analysis to address the identified risks Work Breakdown Structure (WBS) Cost estimates Time estimates Gantt Chart for Construction with additional iterations as needed and a work schedule Make feasibility decisions before committing to build	Q1-Q3 2024 (Complete)
Construction Target discovery pipeline (POC)	Learn best ways to use LLMs for NL queries Lessons learned published	Results of experiments Lessons learned recorded and published	Q4 2024 (Almost complete in December 2024)
Construction (PROD)	Target discovery pipeline	Target discovery pipeline – detailed deliverables are not yet known Lessons learned recorded and published to be specified by research project sponsors	TBD
Transition	Sustainability achieved	Place the prototype into maintenance mode or outsource for continuous development by another organization (e.g. non-profit) Plan extension work, if any	TBD

Risk Registry

Risks in green are resolved

Risks in yellow are in active research

Risks in white are general in nature

BioCypher
Can we also use Open Targets native API https://platform.opentargets.org/api OR https://platform-docs.opentargets.org/data-access/google-bigquery
Need to record selection criteria and rank the technologies on these criteria
Need to produce an architecture diagram

Technology research, feature and cost analysis, and selection

Technology research

If no ready-to-use technology exists, estimate bespoke development

If estimates indicate infeasibility, this may become a gap

in this phase (“project elaboration”)

Description	Mitigation	Responsible Party
Failure to identify business questions, or picking too many or too few	Draft appropriate business questions - DONE; but not all business questions can be answered with specific technologies, so must take this factor into account	Lee Harland, John Wise, Bruce Press, Peter Revill
Validate that Open Targets either has a ready to use Knowledge Graph implementation, or can be converted into a KG with reasonable cost	Two competing, not fully compatible platforms: Neo4j/Cypher and GraphDB/SPARQL BioCypher (Neo4j stack) Metaphacts (SPARQL stack) Sponsors prefer the Neo4j OT partly imported into BioCypher and will be imported into Metaphacts soon Requirements and technology mapping to these reqs Architecture diagram linkedlifedata.com	The Hyve Jordan Ramsdell Robert Gill Brian Evarts Open Targets/EBI: Sebastian Lobentanzer Ellen McDonagh
Failure to identify a suitable LLM		See this comparison Recommend to focus on the Cypher query generation ability as the key risk (below) Start with one open-source and one closed-source LLMs (say Mistral and GPT 4) and agree to explore others later, and meanwhile close this risk	Jon Stevens, Etzard Stolte, Helena Deus; Brian Evarts; Wouter Franke, Matthijs van der Zee	Does Open Targets use an ontology?	Perhaps The Hyve team has a ready answer	Failure to download a large volume of data (all of the PubMed as a maximum) for the prompt-tuning of the LLM	TBD	Failure to perform KG generation from text by an LLM	Technology research If no ready-to-use technology exists, estimate bespoke development (tuning an existing LLM for this purpose)	Failure to perform local KG comparison with calculation of a score
Failure to generate a proper query for a KG database system by an LLM	Technology research. See refs 7, 8, 13, 14 below BioCypher by EBI may have this capability already - needs evaluation	The Hyve Jordan Ramsdell Robert Gill Brian Evarts Open Targets/EBI: Sebastian Lobentanzer Ellen McDonagh
Does Open Targets use an ontology?	Yes in general	The Hyve
Failure to download a large volume of data (all of the PubMed as a maximum) for the prompt-tuning of the LLM	CLOSED RISK This is not necessary
Failure to perform local KG comparison with calculation of a score	CLOSED RISK This is not necessary - we can compare the output manually
Failure to build a prototypical target discovery pipeline on the limited budget in case of mounting technical difficulties	CLOSED: Schedule the project in phases. Aim to answer known unknowns and to establish risk mitigation strategies early	.	It is not yet known whether a product will be built. For now the scope is focused on the technology analysis for the POC
Some proprietary LLMs may be censored, thus introducing uncontrollable bias in the answers that they produce	DONE: Strong preference for an CLOSED. Censorship may already be included in the performance scores, so this is taken care of in the comparison of the LLMs. However, there is team preference for open-source , and uncensored LLMs	Identified and resolved in the LLM sub-team

...

Stakeholder mailing list in Google Groups: https://groups.google.com/a/pistoiaalliance.org/g/llm-projectMS Teams: Large Language Models | General | Microsoft Teams

Meetings

Every other week at 8 am PST (= 11 am EST = 4 pm London = 5 pm Berlin) starting on January 10th, 2024

2024.01.10 Recording (Passcode: P#69H5dm) Slides
2024.01.24 Recording (Passcode: t!O5T38a) Slides
2024.02.07 Recording (Passcode: L58@v7Dg) Slides | Slides from the talk by Sebastian Lobentanzer
2024.02.21 Recording (Passcode: B59W3wT+) Slides
2024.03.13 Recording (Passcode: f8B#zunH) Slides
2024.03.20 Recording (Passcode: LZ!jZT4z) Slides | Architecture diagram in Draw.io | Architecture diagram PNG file PNG file
2024.04.03 Recording (Passcode: mSH4#u2%) Slides
2024.04.17 Recording (Passcode: Yn2!5qJK) Slides | Slides from the talk by Jon Stevens
2024.05.01 Recording (Passcode: 54MvxsP#) Slides
2024.05.15 Recording (Passcode: rU#y91m@) Slides
2024.05.29 Recording (Passcode: c3df=mWx) Slides
2024.07.09 Recording (Passcode: LY=QRI9H) Slides
2024.07.24 Recording (Passcode: G36*B=Qv) Slides
2024.08.07 Recording (Passcode: %.1&ukfM) Slides | Includes a talk by Peter Dorr: SPARQL query code generation with LLMs
2024.09.04 Recording (Passcode: t3?B*?CX) Slides | Includes a talk by Oleg Stroganov on agents controlling the actions of LLMs | Slides from the talk by Oleg Stroganov
2024.09.18 Recording (Passcode: #m5#8$V1) Slides
2024.10.02 Recording (Passcode: j2nT#H3. ) Slides
2024.10.16 Recording (Passcode: z&W8bGWL) Slides | Slides by Oleg Stroganov with an update
2024.10.30 Recording (Passcode: 2wJVC=?r) Slides | 2024.10.26 Rancho Bioscience update
2024.11.06 Recording (Passcode: @A4H&P1D) Slides | Slides by Oleg Stroganov with an update
2024.11.19 Email communication: Slides from the report by Oleg Stroganov
2024.11.20 Recording (Passcode: EE!C54u#) Slides
2024.12.04 Recording (Passcode:E.?p#b$9) Slides | Slides by Oleg Stroganov with an update
2024.12.18 Recording (Passcode: $O9uxYXy) Slides | Slides by Oleg Stroganov with an update

Github

https://github.com/PistoiaAlliance/LLM

Final Report

2024.12.31 Slides | Instructions for files upload into a Neo4j instance

Lessons Learned

Modern LLMs have sufficient knowledge of biology embedded in them to be able to answer almost any question we (humans) can think about. This is a source of problems: hallucinations are indistinguishable from true answers; we cannot fully test the innate ability of the LLMs to translate the natural language questions into structured queries (unless we obscure the terms with synonyms unknown to the LLM).
The highest risk item is generation of the structured query (Cyphrer or SPARQL) from a plain English request. Some publications estimate success rate of about 48% on the first attempt.
The structure of the database used for queries matters. LLMs can easier produce meaningful structured queries for databases with flat, simple structure.
The form of the prompt matters. LLMs can easier produce meaningful answers from prompts that resemble a story, rather than a dry question, even if the details of the story are irrelevant to the main question asked.
Practically useful system requires filtering or secondary mining of output in addition to natural language narration.
It is extremely important to implement a reliable named entity recognition system. The same acronym can refer to completely different entities, which can be differentiated either from the context (hard) or by asking clarifying questions. Must also map synonyms. Without these measures naïve queries in a RAG environment will fail.
LLMs may produce different structured queries starting from the same natural language question. These queries may be semantically and structurally correct, but may include assumptions on the limit of the number of items to return, or order, or lack of these. These variations are not deterministic. As a result on different execution rounds the same natural language may result in different answers. It is necessary to explicitly formulate the limits, order restrictions, and other parameters when asking the question, or to determine the user’s intentions in a conversation with a chain of thought. A question related to this topic, is whether specifics in the implementation of usual RAG models with a vector database may introduce implicit restrictions on what data is explored by the LLM and what data is not, and thus artificially limit the answers. This may be happening without the user knowing the restrictions (and perhaps even without the system’s authors knowing that they introduced such restrictions embedded in the specifics of the system architecture).
Need for an API standard.
There is no good biological test-set for LLM evaluation
Existing test sets are saturated
Background knowledge contaminates the results
Frontier models have ~100% biological background knowledge, which makes evaluation of cypher query generation difficult as these models can bypass cypher queries generation and hallucinate correct results
Models need an independent way for entity resolution as KG may not have specific synonyms provided by user
Small models (mistral, llama 13B etc) underperform; even their adapters / fine tuned versions trained on cypher generation fail on OT KG
Providing automatically-generated graph schema doesn't really help for OT KG
Graph Schema matters
OT KG has non-trivial schema that makes LLMs very confused
Template-based strategies achieve 100% performance in current evaluation, but require hard-coded templates for information retrieval
Agent-based strategies achieve 83-98% performance, and do not require prior knowledge to work with KGs
DSPy framework could be used to optimize prompts and increase success rate
Future directions:
Development of a better test-set that could be used to improve LLM ability to generate cypher queries for biological knowledge
Further improvement of performance of agent-based approach by using long-term history (generation of templates)
Introducing variation into KG schema as a parameter for testing. Using KGs with strict schema and KGs that are automatically extracted from literature to test how reliable strategies are
Introducing entity resolution to improve information retrieval

References

https://www.sciencedirect.com/science/article/pii/S1359644613001542
https://www.nature.com/articles/s41573-020-0087-3
https://www.epam.com/about/newsroom/press-releases/2023/epam-launches-dial-a-unified-generative-ai-orchestration-platform
https://epam-rail.com/open-source
Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Chatbot Arena: https://chat.lmsys.org/?arena
Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
https://arxiv.org/abs/2310.01061
Knowledge-Consistent Dialogue Generation with Language Models and Knowledge Graphs
https://openreview.net/forum?id=WhWlYzUTJfP&source=post_page-----97a4cf96eb69--------------------------------
BioChatter Benchmark Results: https://biochatter.org/benchmark-results/#biochatter-query-generation
MBET Benchmark (embeddings) https://huggingface.co/spaces/mteb/leaderboard
Lora-Land and Lorax: https://predibase.com/lora-land
A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases. Summary: queries over a KG with GPT 4 are much more accurate than queries over a SQL database with GPT 4. https://arxiv.org/abs/2311.07509
https://towardsdatascience.com/evaluating-llms-in-cypher-statement-generation-c570884089b3
https://medium.com/neo4j/enhancing-the-accuracy-of-rag-applications-with-knowledge-graphs-ad5e2ffab663
linkedlifedata.com
Kazu - Biomedical NLP Framework: https://github.com/AstraZeneca/KAZU
https://github.com/f/awesome-chatgpt-prompts/tree/main
Zhou, L., Schellaert, W., Martínez-Plumed, F. et al. Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024). https://doi.org/10.1038/s41586-024-07930-y
Karthik Soman, Peter W Rose, John H Morris, Rabia E Akbas, Brett Smith, Braian Peetoom, Catalina Villouta-Reyes, Gabriel Cerono, Yongmei Shi, Angela Rizk-Jackson, Sharat Israni, Charlotte A Nelson, Sui Huang, Sergio E Baranzini, Biomedical knowledge graph-optimized prompt generation for large language models, Bioinformatics, Volume 40, Issue 9, September 2024, btae560, https://doi.org/10.1093/bioinformatics/btae560
https://www.promptingguide.ai/
References on Named Entity Recognition in biological sciences: Pubmed
Incremental Knowledge Graphs Constructor Using Large Language Models

Version	Old Version 16	New Version 70
Changes made by	Vladimir Makarov	Vladimir Makarov
Saved on	Mar 22, 2024	Jan 02, 2025

Versions Compared

Key

Not in scope:

Risk Registry

Meetings

Github

Final Report

Lessons Learned

References

Content Comparison

Versions Compared

Key

Not in scope:

Risk Registry

Meetings

Github

Final Report

Lessons Learned

References