Query generation with LLM

Team: The Hyve; Jordan Ramsdell, Robert Gill, Brian Evans + Open Targets/EBI team: Sebastian Lobentanzer, Ellen McDonagh

Link to the working doc: https://docs.google.com/document/d/18vPi23prPnrBOX3xAQInilqC6C4ssy2neOl05DDSH1Y/edit?usp=sharing

 

2024.04.12 Sub-team meeting

 

2024.04.19 Sub-team meeting

Recording: https://pistoiaalliance-org.zoom.us/rec/share/hdZ_7adOrM9TmAxX2H3Byist0IETyMvMulABcJYk6PZR2TMnHsuZeJWtzPBtwXfB.q3UOy_kaD6Zmb0Pt

Passcode: PMik3&s%

 

Main objective: need to set-up a testing environment to systematically evaluate and improve LLM ability to generate Cypher queries

  • We will initially test GPT4 and Mistral (picks by the LLM selection team)

  • How is Cypher query generation testing done by the BioCypher team?

  • Propose a set of English questions that we will use (limited by the current OT contents in BioCypher) - Confirm with The Hyve that the questions in our list currently flagged as feasible are indeed such

    • Do we need to write “ideal” Cypher queries for these questions? - yes, make sure we understand the questions asked -this is a line item in the RFP

    • VM Fwd questions in column M to experts

  • Comment from Etzard: in his system elastic search is used across documents to by-pass the failing SPARQL queries generated by LLM, is this conceptually similar to the method proposed by Abbvie? VM shared the recording of the Jon’s talk.

  • In general, it is good for us to collect these and similar hacks that force LLMs to produce better queries

  • Action item for the team (all members): please think about any additional requirements that we need to include into the RFP

  • Note, if payments (beyond minimum cloud expenses reimbursement) are needed we must do a competitive RFP/RFQ

 

2024.05.03 Sub-team meeting

Recording:

https://pistoiaalliance-org.zoom.us/rec/share/Kk5LKB30E9wJLt5K9x0gdyJOPkAyfgnnOX2Aroob27CWbvEaYM4Tzid5Vv6cYwfp.VXBa1-hMCZ3rv3XJ

Passcode: jm+.s6T.

There were internet connectivity failures during the call. Thus recording may be imperfect.

  • A testing environment to systematically evaluate and improve LLM ability to generate Cypher queries requires vendor support, and hiring a vendor requires an RFP. Vladimir informed the team about the upcoming RFP and the process for it

  • We will need to create "correct" answers to the scientific competency questions that we plan to use in the POC. ZS colleagues volunteered to perform this service. Details will be decided next week in a call between Vladimir and Bruce Press. This is FYI only, no action needed.

  • Participants observed that not all questions contained in our scientific competency question list can be answered based only on the information in Open Targets. For example, any questions that refer to clinical trials may not have complete answers based on the Open Targets contents alone. These issues will not have an effect on the POC project, however. In the future we may have to ask the project funders whether they would like to invest in data improvements or not.

  • We will have to brainstorm techniques for improvement of LLM performance in writing Cypher queries. (Or any structured query language).

    • A registry for proposed techniques that would contain high-level or pseudocode algorithmic descriptions of them: https://docs.google.com/document/d/18vPi23prPnrBOX3xAQInilqC6C4ssy2neOl05DDSH1Y/edit?usp=sharing

    • Peter Dorr shared DOIs to papers that describe other techniques in this field - already captured in the brainstorming document

    • Peter Dorr agreed to organize a talk by his organization in one of our main team (Wednesday) meetings, where the methods developed by his company can be shared. Exact date TBD

 

2024.05.10 Sub-team meeting

Recording: https://pistoiaalliance-org.zoom.us/rec/share/lO0lWPyLwAbmSNSwhq-zRAzplmt4gWQ6Mk7nEVf7xBPj9vqlkQaGN_AvrLj_-Wfo.sVYItwdj42Y1L0lJ

Passcode: W^rVQ6W=

  • We agreed to focus on making edits and additions to the list of methods in https://docs.google.com/document/d/18vPi23prPnrBOX3xAQInilqC6C4ssy2neOl05DDSH1Y/edit?usp=sharing

  • Action items:

    • For all team members: review the list of methods and add new ones, or note errors and omissions in the existing descriptions PLEASE LOOK INTO THIS

    • VM ask EPAM colleagues about the location of the code that performs the selection of the query template in method #1 - DONE, ALAS, CODE NOT AVAILABLE

    • VM confirm with Jon Stevens that the pseudo-code in method #3 is accurate - DONE, YES IT IS

    • Brian Evarts: has connections/colleagues who work on similar technologies, will copy the links

  • We agreed to skip the call on May 17th (VM will cancel it) and meet again on May 24th

 

2024.05.24 Sub-team meeting

Recording: https://pistoiaalliance-org.zoom.us/rec/share/ifOiIKlsKGzOA_8LdYV61EZQT4Q5yUaaOEFnjv7i4d4v-FnMov14tWWuck6LRrIp.iwYxlHCH0exbA31r?startTime=1716559386000

Passcode: ZA!z3S5t

2024.06.21 Sub-team meeting

Agenda:

  1. Last chance for any Q&A about the RFP

  2. Look at the specific business questions that we will use for testing, and the “true” answers for them: https://docs.google.com/document/d/1_WvkgveIxUYb8rS_5wBq4VLL48CTNphq/edit?usp=drive_link&ouid=111803761008578493760&rtpof=true&sd=true (this work was done by the ZS team - many thanks!)

Recording:

https://pistoiaalliance-org.zoom.us/rec/share/j1s8mE9Ytq4aEkMHoYroEKfcx0ThzeArtvbkcmMrB9yqCoJBHMltBgP-hsjLpgWz.-pEWfzVBloiYxqbA?startTime=1718978573000
Passcode: P0@LM7E!