...
https://www.sciencedirect.com/science/article/pii/S1359644613001542
Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Chatbot Arena: https://chat.lmsys.org/?arena
Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning
Knowledge-Consistent Dialogue Generation with Language Models and Knowledge Graphs
BioChatter Benchmark Results: https://biochatter.org/benchmark-results/#biochatter-query-generation
MBET Benchmark (embeddings) https://huggingface.co/spaces/mteb/leaderboard
Lora-Land and Lorax: https://predibase.com/lora-land
A Benchmark to Understand the Role of Knowledge Graphs on Large Language Model's Accuracy for Question Answering on Enterprise SQL Databases. Summary: queries over a KG with GPT 4 are much more accurate than queries over a SQL database with GPT 4. https://arxiv.org/abs/2311.07509
https://towardsdatascience.com/evaluating-llms-in-cypher-statement-generation-c570884089b3
Kazu - Biomedical NLP Framework: https://github.com/AstraZeneca/KAZU
Zhou, L., Schellaert, W., Martínez-Plumed, F. et al. Larger and more instructable language models become less reliable. Nature 634, 61–68 (2024). https://doi.org/10.1038/s41586-024-07930-y