Competency Questions in Target Discovery

Reference papers: https://www.sciencedirect.com/science/article/pii/S1359644613001542 and https://www.nature.com/articles/s41573-020-0087-3 list many possible competency questions. We need to refine the list and gather 20 to 100 (no more, ideally fewer) questions that can be answered with knowledge graphs and LLM RAG.

Sub-Team: Lee Harland, John Wise, Bruce Press

Link to the working Google doc: https://docs.google.com/document/d/17-fwEYe1BKiGzZ4rzV7oKEJ-pWGrbN9WV1r-2INwTRI/edit?usp=sharing

Link to the spreadsheet with questions: https://docs.google.com/spreadsheets/d/16arTCfdguNGdl916ZdbRkaS7SxDa9x3I/edit?usp=sharing&ouid=111803761008578493760&rtpof=true&sd=true

2024.02.12 sub-team call:

Recording:
This sub-team is done.
Questions are defined. Some can be readily answered with the Open Targets KG. Good enough questions are ok. Not everything should be answerable right away.
Let us now take one or more of easy to address (“green”) questions and feed it to KG. Get POC for technical and procedural feasibility.
- No need to invest in harder questions now, no need to expand the Open Targets KG
- Success criterion #1: compare RAG answer to the opinion from a human scientist who is an expert on the topic of the question
- Success criterion #2: compare RAG answer (with a plain language question asked of LLM) to the KG-derived answer produced by an expert data scientist
Strategy: Ask at the next large team call: validate an approach to encase Open Targets KG questions as multiple modules with tuned prompts for a RAG system, one for each of the questions. But say we are successful in the POC, - then what? Need a better vision of success, exciting, complex.
- Creating many open-source APIs to public data sources is not exciting. Perhaps define a standard for such APIs?
- Need for the API standard is one lesson learned
- This is valuable for KG software vendors: ease of “wiring in” additional data, including proprietary data sources
- What is the volatility of the data sets that we want to eventually use? For rapidly developing data sources continuous updates may be needed to accommodate ongoing changes in data source structure. So need a data standard for rapidly evolving data sources, with data increments (easier use case) or completely new data dimensions and variables (harder use case). Can a data source present itself to a RAG query system to automate data updates?
- Possible new risk: will LLMs be confused by the similar data types from multitude of sources?
- Provider / vendor / expert ecosystem needed - not just Open AI ChatGPT with Medline in it