Benchmarks for LLMs in Biological R&D

Benchmarks for LLMs in Biological R&D

We wish to explore the use of Large Language Models for biological research, and create benchmarks and tools for the objective assessment of the ability of the LLMs to interpret scientific questions and mine scientific databases in natural language.

Problem Statement

The pharma/biotech industry is actively experimenting with Natural Language to Query Language translation (NL2QL), “AI co-scientist”, or Scientific Chat applications. Most recently the Pistoia Alliance completed an investigation into the best strategies to use LLMs for data mining in a natural language. One key discovery made in our study is the lack of appropriate benchmarks for the assessment of all steps in the NL data mining process. The lack of the appropriate test sets complicates the tool development in the NL data mining. The proposed project aims to close this gap. The Pistoia Alliance will serve as a neutral party in the organizing benchmark development and maintenance.

Project objectives:

  • Find or construct benchmarks that enable assessment of true performance of LLM natural language query assistants on each of the 4 stages in the NL data mining process:

    • Understanding the question

    • Recognition of named entities, synonyms, and disambiguation of terms

    • Building the structured query

    • Assessment of the overall answer quality

  • Measure and assure the quality of the identified or newly created benchmarks

  • Report the findings and lessons learned

Value Proposition and Expected Results

We expect that the project stakeholders will receive these benefits from it:

High-level:

  • Pharma company users can make better tool selection decisions based on the objective evaluation of technologies

  • Technology vendors can better plan product improvements

  • Understanding of the best practices and quality standards for the creation of benchmarks

    • This benefit extends beyond this specific use case

  • A process for creation of community-supported benchmarks at the Pistoia Alliance

    • This benefit is also broadly applicable

  • Enhance the best practices in Natural Language data mining

  • The overall quality and speed of drug discovery R&D may be improved

Technical:

  • Understanding of the current state of quality assessment and benchmarking for the AI applications used in the Natural Language data mining

  • A set of benchmarks that cover steps in the Natural Language data mining process that currently do not have appropriate benchmarks

  • Publications

Alignment with the Pistoia Alliance Strategic Priorities

This project is part of the Artificial Intelligence at Scale strategic priority.

Project Scope

In scope:

  • Review of the already existing or proposed benchmarks for the four steps in the NL data mining process. Although we reasonably believe that this is a scientific gap, learning from earlier attempts should be instructive

  • White paper describing the problem space and the proposed solution (optionally more than one white paper, or an academic paper)

  • A set of scientific benchmarks for each of the four listed steps in the NL data mining. Each of these should contain suitable test sets, statistical evaluation metrics and cutoffs, assessment against benchmark quality criteria, and recommendations for updates

  • A plan for long-term maintenance and evolution of the proposed benchmarks

Not in scope:

  • Training of custom Large Language Models

  • Development of novel NL data mining software or “AI co-scientist” systems

  • Benchmarks for other use cases beyond Natural Language data mining

  • Development of federated or secure compute benchmark execution systems

Success Measures:

Success is defined as the achievement of the specific aims:

  1. Existing benchmarks for each of the four steps in the NL query life cycle are reviewed, quality assessed, and catalogued

  1. Gaps in the existing benchmarks are identified

  1. Benchmarks for these gaps are proposed, constructed, and assessed against the published quality criteria

  1. A set of POC experiments where public LLMs are assessed against the proposed new benchmarks

  1. Final report published for the benefit of the PA AI Community. Final report shall include long-term benchmark evolution and maintenance plan

  2. Code and data for any newly proposed benchmarks deposited to Github

Project Phases and Milestones

RUP Phase

Milestones

Deliverables

Est Date

Initiation

Project charter

  1. Activity charter

  2. Activity plan

10/30/2025 (DONE)

Elaboration

Account of the existing benchmarks and gaps not covered by them

White paper 1

12/31/2025

Construction

Proposal for the benchmarks that are needed

White paper 2 (we may write two separate publications or combine them in one)

3/31/2026

Construction

Assessment of some (which exactly TBD) AI models against the proposed benchmarks

  1. Code and data package in github

  1. Final report

6/30/2026

Transition

Plan for the long-term maintenance of the benchmarks

Final report

7/1/2026

Risk Registry

Risks in green are resolved

Risks in yellow are in active research

Risks in white are general in nature

Description

Mitigation

Responsible Party

Inability to create benchmarks of high quality that inherently reduces the value of such benchmarks

  1. Early in the project learn and implement the best practices, e.g. https://hai.stanford.edu/policy/what-makes-a-good-ai-benchmark . An assessment of the benchmarks that we identify at the literature exploration stage against Stanford (or similar) quality criteria may be good content for the future white paper and good learning experience for the team.

  2. Iterative development of benchmarks with quality assessment sessions between rounds.

 

All benchmarks that we seek already exist, the problem is solved, and therefore our project is redundant

Extensive literature review as the first activity. It will indicate which of the four steps in the NLQ work cycle are lacking proper benchmarks. Even if all parts of the work cycle have benchmarks, we can document them in one place, which by itself has value.

 

Very quick technology development may make test cases and lessons learned from them obsolete - HIGH probability risk

  1. Maintain an on-going literature review activity

  2. Iterative development of benchmarks

 

Lack of data for benchmarks

If there is lack of data for a specific benchmark topic (e.g. a specific disease area) find another topic that has more suitable/accessible data.

 

Inability to maintain the resulting benchmark product after the active phase of the project is complete - HIGH probability risk

  1. Explore the ability and desire of project sponsors to extend funds for maintenance.

  1. Plan to transition the benchmarks to another organization (such as a professional society, a science publisher, or an academic group) that may be better prepared to maintain them long term.

  2. Absorb the risk and left the benchmarks published but unmaintained at the completion of the project.

 

Project Stakeholders

Sponsors:

  • Lars Greiffenberg, Abbvie

  • Raul Rodriguez-Esteban, Roche

  • TBD, Genentech

Project Participants:

Stakeholder mailing list in MS Teams: Large Language Models | General | Microsoft Teams

Meetings

  • 2026.01.08 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.12.18 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.12.05 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.11.18 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.11.04 Recording (only open to members on the LLM MS Team) Slides: Summary:

  • 2025.10.27 Recording (only open to members on the LLM MS Team) Slides: Summary:

Github

https://github.com/PistoiaAlliance/LLM

Final Report

  • To be added

Lessons Learned

  • To be added