Benchmarks for LLMs in Biological R&D
We wish to explore the use of Large Language Models for biological research, and create benchmarks and tools for the objective assessment of the ability of the LLMs to interpret scientific questions and mine scientific databases in natural language.
Problem Statement
The pharma/biotech industry is actively experimenting with Natural Language to Query Language translation (NL2QL), “AI co-scientist”, or Scientific Chat applications. Most recently the Pistoia Alliance completed an investigation into the best strategies to use LLMs for data mining in a natural language. One key discovery made in our study is the lack of appropriate benchmarks for the assessment of all steps in the NL data mining process. The lack of the appropriate test sets complicates the tool development in the NL data mining. The proposed project aims to close this gap. The Pistoia Alliance will serve as a neutral party in the organizing benchmark development and maintenance.
Project objectives:
Find or construct benchmarks that enable assessment of true performance of LLM natural language query assistants on each of the 4 stages in the NL data mining process:
Understanding the question
Recognition of named entities, synonyms, and disambiguation of terms
Building the structured query
Assessment of the overall answer quality
Measure and assure the quality of the identified or newly created benchmarks
Report the findings and lessons learned
Value Proposition and Expected Results
We expect that the project stakeholders will receive these benefits from it:
High-level:
Pharma company users can make better tool selection decisions based on the objective evaluation of technologies
Technology vendors can better plan product improvements
Understanding of the best practices and quality standards for the creation of benchmarks
This benefit extends beyond this specific use case
A process for creation of community-supported benchmarks at the Pistoia Alliance
This benefit is also broadly applicable
Enhance the best practices in Natural Language data mining
The overall quality and speed of drug discovery R&D may be improved
Technical:
Understanding of the current state of quality assessment and benchmarking for the AI applications used in the Natural Language data mining
A set of benchmarks that cover steps in the Natural Language data mining process that currently do not have appropriate benchmarks
Publications
Alignment with the Pistoia Alliance Strategic Priorities
This project is part of the Artificial Intelligence at Scale strategic priority.
Project Scope
In scope:
Review of the already existing or proposed benchmarks for the four steps in the NL data mining process. Although we reasonably believe that this is a scientific gap, learning from earlier attempts should be instructive
White paper describing the problem space and the proposed solution (optionally more than one white paper, or an academic paper)
A set of scientific benchmarks for each of the four listed steps in the NL data mining. Each of these should contain suitable test sets, statistical evaluation metrics and cutoffs, assessment against benchmark quality criteria, and recommendations for updates
A plan for long-term maintenance and evolution of the proposed benchmarks
Not in scope:
Training of custom Large Language Models
Development of novel NL data mining software or “AI co-scientist” systems
Benchmarks for other use cases beyond Natural Language data mining
Development of federated or secure compute benchmark execution systems
Success Measures:
Success is defined as the achievement of the specific aims:
Existing benchmarks for each of the four steps in the NL query life cycle are reviewed, quality assessed, and catalogued
Gaps in the existing benchmarks are identified
Benchmarks for these gaps are proposed, constructed, and assessed against the published quality criteria
A set of POC experiments where public LLMs are assessed against the proposed new benchmarks
Final report published for the benefit of the PA AI Community. Final report shall include long-term benchmark evolution and maintenance plan
Code and data for any newly proposed benchmarks deposited to Github
Project Phases and Milestones
RUP Phase | Milestones | Deliverables | Est Date |
Initiation | Project charter |
| 10/30/2025 (DONE) |
Elaboration | Account of the existing benchmarks and gaps not covered by them | White paper 1 | 12/31/2025 |
Construction | Proposal for the benchmarks that are needed | White paper 2 (we may write two separate publications or combine them in one) | 3/31/2026 |
Construction | Assessment of some (which exactly TBD) AI models against the proposed benchmarks |
| 6/30/2026 |
Transition | Plan for the long-term maintenance of the benchmarks | Final report | 7/1/2026 |
Risk Registry
Risks in green are resolved
Risks in yellow are in active research
Risks in white are general in nature
Description | Mitigation | Responsible Party |
Inability to create benchmarks of high quality that inherently reduces the value of such benchmarks |
|
|
All benchmarks that we seek already exist, the problem is solved, and therefore our project is redundant | Extensive literature review as the first activity. It will indicate which of the four steps in the NLQ work cycle are lacking proper benchmarks. Even if all parts of the work cycle have benchmarks, we can document them in one place, which by itself has value. |
|
Very quick technology development may make test cases and lessons learned from them obsolete - HIGH probability risk |
|
|
Lack of data for benchmarks | If there is lack of data for a specific benchmark topic (e.g. a specific disease area) find another topic that has more suitable/accessible data. |
|
Inability to maintain the resulting benchmark product after the active phase of the project is complete - HIGH probability risk |
|
|
Project Stakeholders
Sponsors:
Lars Greiffenberg, Abbvie
Raul Rodriguez-Esteban, Roche
TBD, Genentech
Project Participants:
Stakeholder mailing list in MS Teams: Large Language Models | General | Microsoft Teams
Meetings
2026.01.08 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.12.18 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.12.05 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.11.18 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.11.04 Recording (only open to members on the LLM MS Team) Slides: Summary:
2025.10.27 Recording (only open to members on the LLM MS Team) Slides: Summary:
Github
https://github.com/PistoiaAlliance/LLM
Final Report
To be added
Lessons Learned
To be added