Benchmarks for LLMs in Biological R&D
We wish to explore the use of Large Language Models for biological research, and create benchmarks and tools for the objective assessment of the ability of the LLMs to interpret scientific questions and mine scientific databases in natural language.
Problem Statement
The pharma/biotech industry is actively experimenting with Natural Language to Query Language translation (NL2QL), “AI co-scientist”, or Scientific Chat applications. Most recently the Pistoia Alliance completed an investigation into the best strategies to use LLMs for data mining in a natural language. One key discovery made in our study is the lack of appropriate benchmarks for the assessment of all steps in the NL data mining process. The lack of the appropriate test sets complicates the tool development in the NL data mining. The proposed project aims to close this gap. The Pistoia Alliance will serve as a neutral party in the organizing benchmark development and maintenance.
Project objectives:
Find or construct benchmarks that enable assessment of true performance of LLM natural language query assistants on each of the 4 stages in the NL data mining process:
Understanding the question
Recognition of named entities, synonyms, and disambiguation of terms
Building the structured query
Assessment of the overall answer quality
Measure and assure the quality of the identified or newly created benchmarks
Report the findings and lessons learned
Value Proposition and Expected Results
We expect that the project stakeholders will receive these benefits from it:
High-level:
Pharma company users can make better tool selection decisions based on the objective evaluation of technologies
Technology vendors can better plan product improvements
Understanding of the best practices and quality standards for the creation of benchmarks
This benefit extends beyond this specific use case
A process for creation of community-supported benchmarks at the Pistoia Alliance
This benefit is also broadly applicable
Enhance the best practices in Natural Language data mining
The overall quality and speed of drug discovery R&D may be improved
Technical:
Understanding of the current state of quality assessment and benchmarking for the AI applications used in the Natural Language data mining
A set of benchmarks that cover steps in the Natural Language data mining process that currently do not have appropriate benchmarks
Publications
Alignment with the Pistoia Alliance Strategic Priorities
This project is part of the Artificial Intelligence at Scale strategic priority.
Project Scope
In scope:
Review of the already existing or proposed benchmarks for the four steps in the NL data mining process. Although we reasonably believe that this is a scientific gap, learning from earlier attempts should be instructive
White paper describing the problem space and the proposed solution (optionally more than one white paper, or an academic paper)
A set of scientific benchmarks for each of the four listed steps in the NL data mining. Each of these should contain suitable test sets, statistical evaluation metrics and cutoffs, assessment against benchmark quality criteria, and recommendations for updates
A plan for long-term maintenance and evolution of the proposed benchmarks
Not in scope:
Training of custom Large Language Models
Development of novel NL data mining software or “AI co-scientist” systems
Benchmarks for other use cases beyond Natural Language data mining
Development of federated or secure compute benchmark execution systems
Change of Scope:
Instead of re-using test questions from the old “static” benchmarks, create a “dynamic” benchmark, that can be updated with fresh questions based on recent data automatically. Ideally, the tests run on it should be also automatically evaluated with fault detection agents.
Success Measures:
Success is defined as the achievement of the specific aims:
Existing benchmarks for each of the four steps in the NL query life cycle are reviewed, quality assessed, and catalogued [DONE]
Gaps in the existing benchmarks are identified [DONE]
Benchmarks for these gaps are proposed, constructed, and assessed against the published quality criteria [IN PROGRESS]
A set of POC experiments where public LLMs are assessed against the proposed new benchmarks
Final report published for the benefit of the PA AI Community. Final report shall include long-term benchmark evolution and maintenance plan [IN PROGRESS]
Code and data for any newly proposed benchmarks deposited to Github
Project Phases and Milestones
RUP Phase | Milestones | Deliverables | Est Date |
Initiation | Project charter |
| 10/30/2025 (DONE) |
Elaboration | Account of the existing benchmarks and gaps not covered by them | n/a | 12/31/2025 (DONE LATE) |
Construction | Proposal for the benchmarks that are needed | Interim report to the Steering Committee, optional white paper | 3/31/2026 (DONE) |
Construction | Assessment of some (which exactly TBD) AI models against the proposed benchmarks Changed scope: proposal for a “live” benchmark |
| Aug 31, 2026 (delayed 2 months due to scope change, funds and resources are ok) |
Transition | Plan for the long-term maintenance of the benchmarks |
| Dec 31, 2026 |
Risk Registry
Risks in green are resolved
Risks in yellow are in active research
Risks in white are general in nature
Description | Mitigation | Resolution OR Responsible Party |
Inability to create benchmarks of high quality that inherently reduces the value of such benchmarks |
| Project manager |
All benchmarks that we seek already exist, the problem is solved, and therefore our project is redundant | Extensive literature review as the first activity. It will indicate which of the four steps in the NLQ work cycle are lacking proper benchmarks. Even if all parts of the work cycle have benchmarks, we can document them in one place, which by itself has value. | Our literature review indicates that this is not so, and that old benchmarks even for established processes like Named Entity Recognition are deficient. |
Very quick technology development may make test cases and lessons learned from them obsolete - HIGH probability risk |
| The team is thinking about a dynamic set of benchmarks similar to CASP but with a faster update cycle, and with automated question generation. This minimizes the risk of obsolescence. |
Lack of data for benchmarks | If there is lack of data for a specific benchmark topic (e.g. a specific disease area) find another topic that has more suitable/accessible data. | The team proposed using ClinicalTrials.gov and Open Targets as data sources. These data sources are quite rich and are frequently updated. Therefore this risk is resolved. |
Inability to maintain the resulting benchmark product after the active phase of the project is complete - HIGH probability risk |
| Project manager |
Project Stakeholders
Sponsors:
Lars Greiffenberg, Abbvie
Raul Rodriguez-Esteban, Roche
TBD, Genentech
TBD, Merck
Project Participants:
Stakeholder mailing list in MS Teams: Large Language Models | General | Microsoft Teams
Meetings
2026.05.28 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.05.14 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.04.30 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.04.16 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.04.02 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.03.19 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.03.05 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.02.19 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.02.05 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.01.22 Recording (only open to members on the LLM MS Team) Slides: Summary:
2026.01.08 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.12.18 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.12.05 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.11.18 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.11.04 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.10.27 Recording (only open to members on the LLM MS Team) Slides: Summary:
Github
https://github.com/PistoiaAlliance/LLM
2026.03.17 Interim Status Report
2026.06.01 Interim Status Report
The team evaluated the existing benchmarks and found that they do not fit the purpose. The greatest deficiency is that most of the examples we saw are "static" with infrequent updates. This carries the risk that the benchmark material can be used for LLM training, and then the test results can no longer be trusted. This also means that the benchmark material is dated relative to the bleeding-edge science, and hence of lesser value to practicing scientists. Another problem is that most of the existing benchmarks address the recall of information from LLM memory, not the skills of AI systems to interpret NL questions, plan the query strategy, and extract the relevant data. The latter is critical for the correct functioning of the emergent "AI-co-scientist" systems.
We also learned that our scenario with the querying of structured data sources is considered hard by the builders of LLM benchmarks, and that the current performance of LLMs in this scenario is poor. This makes one question the fitness-for-purpose of the entire "AI co-scientist" class of software.
The team decided that the way forward should be by working on a "dynamic" benchmark that includes the latest data from select sources (thus limiting the risk that top LLMs learn from this data by the time of testing) and automated generation of questions.
This is a scope change relative to the initial vision for the project. The timeline shifted to later dates, but there is no risk to the project itself.
These conclusions were recorded in part in the March interim report (attached), and we are also planning a post (white paper) on the PA web site. I may draft it this week. The report in the PPT file is quite technical; I will make the white paper easier to read for those not deeply involved in these topics.
The team considered limiting the effort to just recording the requirements and the lessons learned, but since we have saved a lot of funding we may continue and work on a prototype.
As practical steps towards prototyping, we:
Selected Open Targets as the example data source.
Additional non-public data sources from technology vendors, such as the Clarivate database, or the Cite Ab, may also be considered. This is being discussed right now.
Established understanding with the Open Targets leadership on the time stamping of the data elements in the data base (recall that we want to pick the latest ones).
Realized that for any structured data source, there is only a certain number of high-level query types possible. These query types can be enumerated and then used as templates for the automated generation of specific test questions. This can be easily scripted, at which point a prototype benchmark generator based on the Open Targets can be used to test LLMs and agentic "co-scientist" systems.
We are exploring a possibility to use pre-release Open Targets data that would make the testing even more strict. Since Open Targets only shares such data with a select club of its pharmaceutical industry partners, the testing itself may have to take place inside the firewall of the respective company that has access, but the results may be published.
We are exploring the possibility of automated test result evaluation and agentic fault detection. This is optional but very highly desired. But we have just started looking into the relevant literature.
Established understanding with the potential academic partners at the BioASQ organization. The proposal is to publish our benchmarking questions as a separate challenge for their periodic testing events. This may also serve as a mechanism for the future sustainability of this work.
As a conclusion from the above, we realized that the next challenge for the AI co-scientist systems is benchmarking their problem-solving and planning logic. To the best of my knowledge, none of the recently published AI co-scientist systems comes with a fair competitive assessment of its ability in research planning.
Final Report
To be added
Lessons Learned
For benchmarking specifically
Training data contamination is a major risk. Keep benchmark questions private initially to prevent LLMs from training on them; use public/private split
Existing biomedical NER/linking benchmarks are inadequate (small, leaky, outdated ontologies, miss long tail)
Pharma interest centers on novel/rare entities & combinations, not saturated topics. Hence the desire for dynamic, frequently updated benchmarks, that should be automated
Hallucinations: LLMs may infer non-existent info from sources (high-order hallucination)
Agentic vs. Single LLMs: Tests should apply to both; agentic systems reduce errors via verification but face same benchmarks
General-purpose benchmark performance does not predict domain-specific reliability
General
Custom trained domain-specific LLMs frequently under-deliver vs. cost (e.g., Roche/Genentech experience shared by Etzard)
Hybrid Approach for NLQ systems: Use LLMs for strategic planning (e.g., choosing resources), but rely on predefined APIs for queries to avoid syntax issues
Performance of even best frontier LLMs is variable, thus they are not interchangeable (observed also in our Phase 1 experiments)
References
NER Benchmarks
BC5CDR (BioCreative V): Focuses on chemical and disease mentions, used for assessing disease-drug relations
JNLPBA (Joint Workshop on Natural Language Processing in Biomedicine and Applications): A standard benchmark for protein, DNA, RNA, cell line, and cell type entities.
NCBI-disease: Focused on disease name recognition and normalization.
BC4CHEMD: Dataset for chemical entity recognition.
BELB (Biomedical Entity Linking Benchmark): A recent, comprehensive framework for standardizing the evaluation of biomedical entity linking across 11 corpora: https://arxiv.org/abs/2410.05046
https://pmc.ncbi.nlm.nih.gov/articles/PMC9931203/ Possibly relevant
i2b2 (Shared Tasks): Frequently used for Clinical NER, particularly in de-identification tasks – not currently available, and may be of limited value to our use case
Database Retrieval Benchmark
DbQA2 from LabBench2: https://drive.google.com/file/d/1BV5UtmBRdpbQoz9jC1AuUF8WUTRQMqK_/view
Papers
https://drive.google.com/file/d/1BV5UtmBRdpbQoz9jC1AuUF8WUTRQMqK_/view (company site: https://lab-bench.ai/ ). The use case most relevant to us is called DbQA2 (see paper text). I note that the authors refer to the same contamination problem that we identified: "...access to more specific and esoteric information within each [database] is measured in order to more specifically measure true data access and avoid training knowledge contaminating results." I wonder whether we can review it in detail and try to re-use any of the questions from this benchmark. Is there anyone who'd volunteer for this task?
https://arxiv.org/pdf/2512.15567 Less relevant use cases, but similar approach in testing LLMs in scenario that require action, not just recall. They did not use the NL database query use case. The provided use cases, however, may be of interest to other team members as they are quite common in drug discovery.
Placebo Bench LLM hallucination benchmark: https://www.blueguardrails.com/en/blog/placebo-bench-an-llm-hallucination-benchmark-for-pharma and https://huggingface.co/datasets/blue-guardrails/PlaceboBench
A set of benchmarks for testing of ADMET property prediction methods developed by Therapeutics Data Commons: https://tdcommons.ai/benchmark/admet_group/overview/ . Plus a critical review of it: https://www.biorxiv.org/content/10.64898/2026.02.26.708193v1. We can derive some best practices from this publication.
AI Co-scientist Systems
Ghareeb, A.E., Chang, B., Mitchener, L. et al. A multi-agent system for automating scientific discovery. Nature (2026). https://doi.org/10.1038/s41586-026-10652-y
Aygün, E., Belyaeva, A., Comanici, G. et al. An AI system to help scientists write expert-level empirical software. Nature (2026). https://doi.org/10.1038/s41586-026-10658-6
Gottweis, J., Weng, WH., Daryin, A. et al. Accelerating scientific discovery with Co-Scientist. Nature (2026). https://doi.org/10.1038/s41586-026-10644-y