Benchmarks for LLMs in Biological R&D

Benchmarks for LLMs in Biological R&D

We wish to explore the use of Large Language Models for biological research, and create benchmarks and tools for the objective assessment of the ability of the LLMs to interpret scientific questions and mine scientific databases in natural language.

Problem Statement

The pharma/biotech industry is actively experimenting with Natural Language to Query Language translation (NL2QL), “AI co-scientist”, or Scientific Chat applications. Most recently the Pistoia Alliance completed an investigation into the best strategies to use LLMs for data mining in a natural language. One key discovery made in our study is the lack of appropriate benchmarks for the assessment of all steps in the NL data mining process. The lack of the appropriate test sets complicates the tool development in the NL data mining. The proposed project aims to close this gap. The Pistoia Alliance will serve as a neutral party in the organizing benchmark development and maintenance.

Project objectives:

  • Find or construct benchmarks that enable assessment of true performance of LLM natural language query assistants on each of the 4 stages in the NL data mining process:

    • Understanding the question

    • Recognition of named entities, synonyms, and disambiguation of terms

    • Building the structured query

    • Assessment of the overall answer quality

  • Measure and assure the quality of the identified or newly created benchmarks

  • Report the findings and lessons learned

Value Proposition and Expected Results

We expect that the project stakeholders will receive these benefits from it:

High-level:

  • Pharma company users can make better tool selection decisions based on the objective evaluation of technologies

  • Technology vendors can better plan product improvements

  • Understanding of the best practices and quality standards for the creation of benchmarks

    • This benefit extends beyond this specific use case

  • A process for creation of community-supported benchmarks at the Pistoia Alliance

    • This benefit is also broadly applicable

  • Enhance the best practices in Natural Language data mining

  • The overall quality and speed of drug discovery R&D may be improved

Technical:

  • Understanding of the current state of quality assessment and benchmarking for the AI applications used in the Natural Language data mining

  • A set of benchmarks that cover steps in the Natural Language data mining process that currently do not have appropriate benchmarks

  • Publications

Alignment with the Pistoia Alliance Strategic Priorities

This project is part of the Artificial Intelligence at Scale strategic priority.

Project Scope

In scope:

  • Review of the already existing or proposed benchmarks for the four steps in the NL data mining process. Although we reasonably believe that this is a scientific gap, learning from earlier attempts should be instructive

  • White paper describing the problem space and the proposed solution (optionally more than one white paper, or an academic paper)

  • A set of scientific benchmarks for each of the four listed steps in the NL data mining. Each of these should contain suitable test sets, statistical evaluation metrics and cutoffs, assessment against benchmark quality criteria, and recommendations for updates

  • A plan for long-term maintenance and evolution of the proposed benchmarks

Not in scope:

  • Training of custom Large Language Models

  • Development of novel NL data mining software or “AI co-scientist” systems

  • Benchmarks for other use cases beyond Natural Language data mining

  • Development of federated or secure compute benchmark execution systems

Change of Scope:

Instead of re-using test questions from the old “static” benchmarks, create a “dynamic” benchmark, that can be updated with fresh questions based on recent data automatically. Ideally, the tests run on it should be also automatically evaluated with fault detection agents.

Success Measures:

Success is defined as the achievement of the specific aims:

  1. Existing benchmarks for each of the four steps in the NL query life cycle are reviewed, quality assessed, and catalogued [DONE]

  2. Gaps in the existing benchmarks are identified [DONE]

  3. Benchmarks for these gaps are proposed, constructed, and assessed against the published quality criteria [IN PROGRESS]

  4. A set of POC experiments where public LLMs are assessed against the proposed new benchmarks

  5. Final report published for the benefit of the PA AI Community. Final report shall include long-term benchmark evolution and maintenance plan [IN PROGRESS]

  6. Code and data for any newly proposed benchmarks deposited to Github

Project Phases and Milestones

RUP Phase

Milestones

Deliverables

Est Date

Initiation

Project charter

  1. Activity charter

  2. Activity plan

10/30/2025 (DONE)

Elaboration

Account of the existing benchmarks and gaps not covered by them

n/a

12/31/2025 (DONE LATE)

Construction

Proposal for the benchmarks that are needed

Interim report to the Steering Committee, optional white paper

3/31/2026 (DONE)

Construction

Assessment of some (which exactly TBD) AI models against the proposed benchmarks

Changed scope: proposal for a “live” benchmark

  1. White paper

  2. Prototype benchmark and question generator based on the Open Targets data

Aug 31, 2026 (delayed 2 months due to scope change, funds and resources are ok)

Transition

Plan for the long-term maintenance of the benchmarks

  1. Code and data package in github

  2. Final report

  3. Long-term sustainability agreement(s)

Dec 31, 2026

Risk Registry

Risks in green are resolved

Risks in yellow are in active research

Risks in white are general in nature

Description

Mitigation

Resolution OR Responsible Party

Inability to create benchmarks of high quality that inherently reduces the value of such benchmarks

  1. Early in the project learn and implement the best practices, e.g. https://hai.stanford.edu/policy/what-makes-a-good-ai-benchmark . An assessment of the benchmarks that we identify at the literature exploration stage against Stanford (or similar) quality criteria may be good content for the future white paper and good learning experience for the team.

  2. Documentation consistent with #1

  3. Iterative development of benchmarks with quality assessment sessions between rounds.

Project manager

All benchmarks that we seek already exist, the problem is solved, and therefore our project is redundant

Extensive literature review as the first activity. It will indicate which of the four steps in the NLQ work cycle are lacking proper benchmarks. Even if all parts of the work cycle have benchmarks, we can document them in one place, which by itself has value.

Our literature review indicates that this is not so, and that old benchmarks even for established processes like Named Entity Recognition are deficient.

Very quick technology development may make test cases and lessons learned from them obsolete - HIGH probability risk

  1. Maintain an on-going literature review activity

  2. Iterative development of benchmarks

The team is thinking about a dynamic set of benchmarks similar to CASP but with a faster update cycle, and with automated question generation. This minimizes the risk of obsolescence.

Lack of data for benchmarks

If there is lack of data for a specific benchmark topic (e.g. a specific disease area) find another topic that has more suitable/accessible data.

The team proposed using ClinicalTrials.gov and Open Targets as data sources. These data sources are quite rich and are frequently updated. Therefore this risk is resolved.

Inability to maintain the resulting benchmark product after the active phase of the project is complete - HIGH probability risk

  1. Explore the ability and desire of project sponsors to extend funds for maintenance.

  2. Plan to transition the benchmarks to another organization (such as a professional society, a science publisher, or an academic group) that may be better prepared to maintain them long term. The team is exploring collaboration opportunities with BioASQ that may be willing to create a test challenge category in its NLP challenges collection that reflects our use case. If this collaboration is successful, the sustainability risk will be resolved.

  3. Absorb the risk and left the benchmarks published but unmaintained at the completion of the project.

Project manager

Project Stakeholders

Sponsors:

  • Lars Greiffenberg, Abbvie

  • Raul Rodriguez-Esteban, Roche

  • TBD, Genentech

  • TBD, Merck

Project Participants:

Stakeholder mailing list in MS Teams: Large Language Models | General | Microsoft Teams

Meetings

  • 2026.05.28 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.05.14 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.04.30 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.04.16 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.04.02 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.03.19 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.03.05 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.02.19 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.02.05 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.01.22 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2026.01.08 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.12.18 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.12.05 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.11.18 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.11.04 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.10.27 Recording (only open to members on the LLM MS Team) Slides: Summary:

Github

https://github.com/PistoiaAlliance/LLM

2026.03.17 Interim Status Report

2026.06.01 Interim Status Report

  • The team evaluated the existing benchmarks and found that they do not fit the purpose. The greatest deficiency is that most of the examples we saw are "static" with infrequent updates. This carries the risk that the benchmark material can be used for LLM training, and then the test results can no longer be trusted. This also means that the benchmark material is dated relative to the bleeding-edge science, and hence of lesser value to practicing scientists. Another problem is that most of the existing benchmarks address the recall of information from LLM memory, not the skills of AI systems to interpret NL questions, plan the query strategy, and extract the relevant data. The latter is critical for the correct functioning of the emergent "AI-co-scientist" systems.

  • We also learned that our scenario with the querying of structured data sources is considered hard by the builders of LLM benchmarks, and that the current performance of LLMs in this scenario is poor. This makes one question the fitness-for-purpose of the entire "AI co-scientist" class of software.

  • The team decided that the way forward should be by working on a "dynamic" benchmark that includes the latest data from select sources (thus limiting the risk that top LLMs learn from this data by the time of testing) and automated generation of questions.

  • This is a scope change relative to the initial vision for the project. The timeline shifted to later dates, but there is no risk to the project itself.

  • These conclusions were recorded in part in the March interim report (attached), and we are also planning a post (white paper) on the PA web site. I may draft it this week. The report in the PPT file is quite technical; I will make the white paper easier to read for those not deeply involved in these topics.

  • The team considered limiting the effort to just recording the requirements and the lessons learned, but since we have saved a lot of funding we may continue and work on a prototype.

  • As practical steps towards prototyping, we:

    • Selected Open Targets as the example data source.

    • Additional non-public data sources from technology vendors, such as the Clarivate database, or the Cite Ab, may also be considered. This is being discussed right now.

    • Established understanding with the Open Targets leadership on the time stamping of the data elements in the data base (recall that we want to pick the latest ones).

    • Realized that for any structured data source, there is only a certain number of high-level query types possible. These query types can be enumerated and then used as templates for the automated generation of specific test questions. This can be easily scripted, at which point a prototype benchmark generator based on the Open Targets can be used to test LLMs and agentic "co-scientist" systems.

    • We are exploring a possibility to use pre-release Open Targets data that would make the testing even more strict. Since Open Targets only shares such data with a select club of its pharmaceutical industry partners, the testing itself may have to take place inside the firewall of the respective company that has access, but the results may be published.

    • We are exploring the possibility of automated test result evaluation and agentic fault detection. This is optional but very highly desired. But we have just started looking into the relevant literature.

    • Established understanding with the potential academic partners at the BioASQ organization. The proposal is to publish our benchmarking questions as a separate challenge for their periodic testing events. This may also serve as a mechanism for the future sustainability of this work.

  • As a conclusion from the above, we realized that the next challenge for the AI co-scientist systems is benchmarking their problem-solving and planning logic. To the best of my knowledge, none of the recently published AI co-scientist systems comes with a fair competitive assessment of its ability in research planning.

Final Report

  • To be added

Lessons Learned

For benchmarking specifically

  1. Training data contamination is a major risk. Keep benchmark questions private initially to prevent LLMs from training on them; use public/private split

  2. Existing biomedical NER/linking benchmarks are inadequate (small, leaky, outdated ontologies, miss long tail)

  3. Pharma interest centers on novel/rare entities & combinations, not saturated topics. Hence the desire for dynamic, frequently updated benchmarks, that should be automated

  4. Hallucinations: LLMs may infer non-existent info from sources (high-order hallucination)

  5. Agentic vs. Single LLMs: Tests should apply to both; agentic systems reduce errors via verification but face same benchmarks

  6. General-purpose benchmark performance does not predict domain-specific reliability

General

  1. Custom trained domain-specific LLMs frequently under-deliver vs. cost (e.g., Roche/Genentech experience shared by Etzard)

  2. Hybrid Approach for NLQ systems: Use LLMs for strategic planning (e.g., choosing resources), but rely on predefined APIs for queries to avoid syntax issues

  3. Performance of even best frontier LLMs is variable, thus they are not interchangeable (observed also in our Phase 1 experiments)

References

NER Benchmarks

  • BC5CDR (BioCreative V): Focuses on chemical and disease mentions, used for assessing disease-drug relations

  • JNLPBA (Joint Workshop on Natural Language Processing in Biomedicine and Applications): A standard benchmark for protein, DNA, RNA, cell line, and cell type entities.

  • NCBI-disease: Focused on disease name recognition and normalization.

  • BC4CHEMD: Dataset for chemical entity recognition.

  • BELB (Biomedical Entity Linking Benchmark): A recent, comprehensive framework for standardizing the evaluation of biomedical entity linking across 11 corpora: https://arxiv.org/abs/2410.05046

  • https://pmc.ncbi.nlm.nih.gov/articles/PMC9931203/     Possibly relevant

  • i2b2 (Shared Tasks): Frequently used for Clinical NER, particularly in de-identification tasks – not currently available, and may be of limited value to our use case

  • https://bioasq.org/

Database Retrieval Benchmark

Papers

  1. https://drive.google.com/file/d/1BV5UtmBRdpbQoz9jC1AuUF8WUTRQMqK_/view   (company site: https://lab-bench.ai/ ). The use case most relevant to us is called DbQA2 (see paper text). I note that the authors refer to the same contamination problem that we identified: "...access to more specific and esoteric information within each [database] is measured in order to more specifically measure true data access and avoid training knowledge contaminating results." I wonder whether we can review it in detail and try to re-use any of the questions from this benchmark. Is there anyone who'd volunteer for this task?

  2. https://arxiv.org/pdf/2512.15567   Less relevant use cases, but similar approach in testing LLMs in scenario that require action, not just recall. They did not use the NL database query use case. The provided use cases, however, may be of interest to other team members as they are quite common in drug discovery.

  3. Placebo Bench LLM hallucination benchmark: https://www.blueguardrails.com/en/blog/placebo-bench-an-llm-hallucination-benchmark-for-pharma and https://huggingface.co/datasets/blue-guardrails/PlaceboBench

  4. A set of benchmarks for testing of ADMET property prediction methods developed by Therapeutics Data Commons: https://tdcommons.ai/benchmark/admet_group/overview/ . Plus a critical review of it: https://www.biorxiv.org/content/10.64898/2026.02.26.708193v1. We can derive some best practices from this publication.

AI Co-scientist Systems