LLM Selection
This is the subpage for the LLM Selection sub-team: Jon Stevens, Etzard Stolte, Helena Deus; Brian Evarts; Wouter Franke, Matthijs van der Zee;
Notes from the January 24th general PM call:
08:40:27 From Brian Evarts (CPT) to Everyone:
Has anyone tried QLORA or other Quantization techniques for fine tuning?
08:42:05 From stevejs to Everyone:
@Brian we had a QLORA fine-tuned llama2 model that we fine-tuned to increase the sequence length. Quality was OK, but we haven’t used it in production because the model was pretty beefy and we need more infra to increase the speed of the model
Notes from the January 26th small team call:
Recording: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: 4=UyzhM$Transcript: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: 4=UyzhM$Private brainstorming document is at: LLM Selection.docx
List of candidate LLMs with evaluation criteria: LLMs.xlsx
Notes from February 1st small team call:
Recording Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: vW*uB7^2Transcript: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: vW*uB7^2The main action item is to add information to the list of candidate LLMs: LLMs.xlsx
Notes from February 15th small team call:
Recording: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: @6UEvs7^Transcript: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: @6UEvs7^Warning: BioCypher may not be W3C compliant, and needs discussion in the large team before adoption - or consider alternatives - so far this is the most important question.
This team cannot make progress until we make the decision about BioCypher
Focus on smaller, cheaper models first? Pick a handful of models, at various size points, look up performance on general benchmarks
What is the task → that dictates the choice of the benchmarks
Verify that BioChatter has benchmarks for writing cypher queries
How important is each benchmark? Perhaps create a linear model that combines multiple scores into a single score
Helena: This benchmark answers the question “what are the best embeddings” across a variety of tasks: https://huggingface.co/spaces/mteb/leaderboard
Convert into a weekly call at the same time on Thursdays for the next six weeks
Notes from February 22nd small team call:
Recording: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: *CRmXi.2Transcript: Video Conferencing, Web Conferencing, Webinars, Screen Sharing
Passcode: *CRmXi.2See notes in: LLM Selection.docx
Notes from March 7th small team call:
All models have been assigned. Please complete the details for the models assigned to you in this spreadsheet: LLMs.xlsx
Recording: Video Conferencing, Web Conferencing, Webinars, Screen Sharing Passcode: onhG57v%
Transcript: Video Conferencing, Web Conferencing, Webinars, Screen Sharing Passcode: onhG57v%
See notes in: LLM Selection.docx
Notes from March 14th small team call:
This call was short and not recorded
The remaining items in the LLM comparison table are costs for the Llama models (Brian to look up) and the performance figures on BioCypher (here we are dependent on Sebastian and may have to wait)
There is an expectation, based on team members' work experiences on other projects, that fine-tuning of open-source models may be heavily dependent on use case and may not be cost-effective
In that case GPT4 would win
Notes from March 21st small team call:
Recording: Video Conferencing, Web Conferencing, Webinars, Screen Sharing Passcode: 8FhD=wtj
Transcript: Video Conferencing, Web Conferencing, Webinars, Screen Sharing Passcode: 8FhD=wtj
Focus on assigning relative weights. It seems that the most important categories are accuracy (on the dimensions of generating queries and writing plain text answers based on structured input), which in turn requires awareness of the biological terminology; then whether the model is open-source or not; and finally the cost. The other factors are seen as co-linear with these.
Homework: please review the spreadsheet and suggest values for the weights
Homework: action item for Brian: please add information in your columns in the spreadsheet [DONE]
New risk identified: some proprietary LLMs, such as ChatGPT, are censored by their authors. This means that in answering of scientific questions they may produce uncontrollable bias. This is a strong argument in favor of uncensored, open-source LLMs.
Based upon discussion today we’d have to take back the statement from the last week that given all equal ChatGPT 4 would win.
Notes from March 28th small team call:
Recording: Video Conferencing, Web Conferencing, Webinars, Screen Sharing Passcode: uwwr&H5A
Transcript: Video Conferencing, Web Conferencing, Webinars, Screen Sharing Passcode: uwwr&H5A
Prompt size may be important, and we increased its weight in the comparison table
Preferred architecture would allow for swapping of LLMs
Censorship is most likely already included in the performance scores - this thought discounts the censorship risk
Given that not all scores are available, we may end up having to do our own evaluation
Consider hosting platforms for open-source models (Amazon Bedrock) instead of renting servers at AWS
Preference for hosted models with pay-per-token
Add this dimension to the spreadsheet ACTION for Jon Stevens
Review rankings - ACTION for Brian and Etzard
Notes from the April 11th small team call:
This call was not recorded, but slides with extensive notes and a file with code captured from a Jupyter notebook (VM) are available:
Vladimir shared observations on LLM behavior in generation of Cypher queries, and on answering questions in English based on structured input, all corroborated by Jon and Brian (and by Rob vis email earlier)
The highest risk step is Cypher code generation
Agreed to delegate the LLM testing to the BioCypher team, and meanwhile pick two LLMs for POC (GPT4 and Mistral)
Officially close this work stream, because we gathered all information we could and now need to learn more by doing - and actually prototype a POC
The new team will be composed of the members of Thursday (LLM choice) and Friday (Open Targets and architecture) sub-teams, and will meet on Fridays
The matter of Cypher query generation from plain English questions is discussed here: Query generation with LLM