Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Modern LLMs have sufficient knowledge of biology embedded in them to be able to answer almost any question we (humans) can think about. This is a source of problems: hallucinations are indistinguishable from true answers; we cannot fully test the innate ability of the LLMs to translate the natural language questions into structured queries (unless we obscure the terms with synonyms unknown to the LLM).

  • The highest risk item is generation of the structured query (Cyphrer or SPARQL) from a plain English request. Some publications estimate success rate of about 48% on the first attempt.

  • The structure of the database used for queries matters. LLMs can easier produce meaningful structured queries for databases with flat, simple structure.

  • The form of the prompt matters. LLMs can easier produce meaningful answers from prompts that resemble a story, rather than a dry question, even if the details of the story are irrelevant to the main question asked.

  • Practically useful system requires filtering or secondary mining of output in addition to natural language narration.

  • It is extremely important to implement a reliable named entity recognition system. The same acronym can refer to completely different entities, which can be differentiated either from the context (hard) or by asking clarifying questions. Must also map synonyms. Without these measures naïve queries in a RAG environment will fail.

  • LLMs may produce different structured queries starting from the same natural language question. These queries may be semantically and structurally correct, but may include assumptions on the limit of the number of items to return, or order, or lack of these. These variations are not deterministic. As a result on different execution rounds the same natural language may result in different answers. It is necessary to explicitly formulate the limits, order restrictions, and other parameters when asking the question, or to determine the user’s intentions in a conversation with a chain of thought. A question related to this topic, is whether specifics in the implementation of usual RAG models with a vector database may introduce implicit restrictions on what data is explored by the LLM and what data is not, and thus artificially limit the answers. This may be happening without the user knowing the restrictions (and perhaps even without the system’s authors knowing that they introduced such restrictions embedded in the specifics of the system architecture).

  • Need for an API standard.

  • There is no good biological test-set for LLM evaluation
    Existing test sets are saturated

    Background knowledge contaminates the results
    Frontier models have ~100% biological background knowledge, which makes evaluation of cypher query generation difficult as these models can bypass cypher queries generation and hallucinate correct results

    Models need an independent way for entity resolution as KG may not have specific synonyms provided by user

    Small models (mistral, llama 13B etc) underperform; even their adapters / fine tuned versions trained on cypher generation fail on OT KG

    Providing automatically-generated graph schema doesn't really help for OT KG

    Graph Schema matters
    OT KG has non-trivial schema that makes LLMs very confused

  • Template-based strategies achieve 100% performance in current evaluation, but require hard-coded templates for information retrieval
    Agent-based strategies achieve 83-98% performance, and do not require prior knowledge to work with KGs
    DSPy framework could be used to optimize prompts and increase success rate

    Future directions:
    Development of a better test-set that could be used to improve LLM ability to generate cypher queries for biological knowledge
    Further improvement of performance of agent-based approach by using long-term history (generation of templates)
    Introducing variation into KG schema as a parameter for testing. Using KGs with strict schema and KGs that are automatically extracted from literature to test how reliable strategies are
    Introducing entity resolution to improve information retrieval

References

  1. https://www.sciencedirect.com/science/article/pii/S1359644613001542

  2. https://www.nature.com/articles/s41573-020-0087-3

  3. https://www.epam.com/about/newsroom/press-releases/2023/epam-launches-dial-a-unified-generative-ai-orchestration-platform

  4. https://epam-rail.com/open-source

  5. Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

  6. Chatbot Arena: https://chat.lmsys.org/?arena

  7. Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

    https://arxiv.org/abs/2310.01061

  8. Knowledge-Consistent Dialogue Generation with Language Models and Knowledge Graphs

    https://openreview.net/forum?id=WhWlYzUTJfP&source=post_page-----97a4cf96eb69--------------------------------

  9. BioChatter Benchmark Results: https://biochatter.org/benchmark-results/#biochatter-query-generation

  10. MBET Benchmark (embeddings) https://huggingface.co/spaces/mteb/leaderboard

  11. Lora-Land and Lorax: https://predibase.com/lora-land

  12. A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases. Summary: queries over a KG with GPT 4 are much more accurate than queries over a SQL database with GPT 4. https://arxiv.org/abs/2311.07509

  13. https://towardsdatascience.com/evaluating-llms-in-cypher-statement-generation-c570884089b3

  14. https://medium.com/neo4j/enhancing-the-accuracy-of-rag-applications-with-knowledge-graphs-ad5e2ffab663

  15. linkedlifedata.com

  16. Kazu - Biomedical NLP Framework: https://github.com/AstraZeneca/KAZU

  17. https://github.com/f/awesome-chatgpt-prompts/tree/main

  18. Zhou, L., Schellaert, W., Martínez-Plumed, F. et al. Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024). https://doi.org/10.1038/s41586-024-07930-y

  19. Karthik Soman, Peter W Rose, John H Morris, Rabia E Akbas, Brett Smith, Braian Peetoom, Catalina Villouta-Reyes, Gabriel Cerono, Yongmei Shi, Angela Rizk-Jackson, Sharat Israni, Charlotte A Nelson, Sui Huang, Sergio E Baranzini, Biomedical knowledge graph-optimized prompt generation for large language models, Bioinformatics, Volume 40, Issue 9, September 2024, btae560, https://doi.org/10.1093/bioinformatics/btae560

  20. https://www.promptingguide.ai/

  21. References on Named Entity Recognition in biological sciences: Pubmed

  22. Incremental Knowledge Graphs Constructor Using Large Language Models