Galileo, a leading machine learning (ML) company for unstructured data, released a Hallucination Index developed by its research arm, Galileo Labs, to help users of today’s leading LLMs determine which model is least likely to hallucinate for their intended application. The findings can be viewed HERE.
“2023 has been the year of LLMs. While everyone from individual developers to Fortune 50 enterprises has been learning how to wrangle this novel new technology, two things are clear: first, LLMs are not one size fits all and second, hallucinations remain one of the greatest hurdles to LLM adoption,” said Atindriyo Sanyal, Galileo’s co-founder and CTO. “To help builders identify which LLMs to use for their applications, Galileo Labs created a ranking of the most popular LLMs based on their propensity to hallucinate using our proprietary hallucination evaluation metrics, Correctness and Context Adherence. We hope this effort sheds light on LLMs and helps teams pick the perfect LLM for their use case.”
While businesses of all sizes are building LLM-based applications, these efforts are being hindered by hallucinations that pose significant challenges in generating accurate and reliable responses. With hallucinations, AI generates information that appears realistic at first glance yet is ultimately incorrect or disconnected from the context.
To help teams get a handle on hallucinations and identify the best LLM that suits their needs, Galileo Labs developed a Hallucination Index that takes 11 LLMs from Open AI (GPT-4-0613, GPT-3.5-turbo-1106, GPT-3.5-turbo-0613 and GPT-3.5-turbo-instruct), Meta (Llama-2-70b, Llama-2-13b and Llama-2-7b), TII UAE (Falcon-40b-instruct), Mosaic ML (MPT-7b-instruct), Mistral.ai (Mistral-7b-instruct) and Hugging Face (Zephyr-7b-beta) and evaluates each LLM’s likelihood to hallucinate in common generative AI task types.
Key insights include:
- Question & Answer without Retrieval (RAG): In this comprehensive evaluation, OpenAI’s GPT-4 emerges as the top performer with a Correctness Score of 0.77, demonstrating remarkable accuracy and the least likelihood of hallucination in the Question & Answer without RAG task, underscoring its dominance in general knowledge applications.
Among open source models, Meta’s Llama-2-70b leads (Correctness Score = 0.65), while other models like Meta’s Llama-2-7b-chat and Mosaic ML’s MPT-7b-instruct showed a higher propensity for hallucinations in similar tasks with Correctness Scores of 0.52 and 0.40 respectively.
The Index recommends GPT-4-0613 for reliable and accurate AI performance in this task type.
- Question & Answer with RAG: OpenAI’s GPT-4-0613 excelled as the top contender with a Context Adherence score of 0.76, while the more cost-effective and faster GPT-3.5-turbo-0613 and -1106 models matched its performance closely with Context Adherence scores of 0.75 and 0.74 respectively.
Surprisingly, Hugging Face’s Zephyr-7b (Context Adherence Score = 0.71), an open source model, surpassed Meta’s much larger Llama-2-70b (Context Adherence Score = 0.68), challenging the notion that bigger models are inherently superior.
However, TII UAE’s Falcon-40b (Context Adherence Score = 0.60) and Mosaic ML’s MPT-7b (Context Adherence Score = 0.58) lagged for this task.
The Index recommends GPT-3.5-turbo-0613 for this task type. - Long-form Text Generation: OpenAI’s GPT-4-0613 again emerged as a top performer (Correctness Score = 0.83), showing the least tendency to hallucinate, while GPT-3.5-turbo models (1106 and 0613) matched its prowess with Correctness Scores of 0.82 and 0.81 respectively, offering potential cost savings and enhanced performance.
Remarkably, Meta’s open source Llama-2-70b-chat rivaled GPT-4’s capabilities (Correctness Score = 0.82), presenting an efficient alternative for this task. Conversely, TII UAE’s Falcon-40b (Correctness Score = 0.65) and Mosaic ML’s MPT-7b (Correctness Score = 0.53) trailed behind in effectiveness.
The Index recommends Llama-2-70b-chat for an optimal balance of cost and performance in Long-form Text Generation. - Open AI Comes Out on Top: When it comes to hallucinations, Open AI’s models outperformed their peers across all task types. This however comes at a cost, as Open AI’s API-based pricing model can quickly drive up costs associated with building a Generative AI product.
- Open Source Cost Savings Opportunities: Within OpenAI’s model offerings, organizations can reduce spend by opting for lower-cost versions of their models, such as GPT-3.5-turbo. The biggest cost savings however come from going with open source models.
- For Long-form Text Generation task types, Meta’s open source Llama-2-13b-chat model is a worthy alternative to Open AI’s models.
- For Question & Answer with RAG task types, users can confidently try the nimble but powerful Zephyr model from Hugging Face instead of OpenAI. Inference cost of Zephyr is 10x lesser than GPT-3.5 Turbo.
Supporting these analyses are Galileo’s proprietary evaluation metrics Correctness and Context Adherence. These metrics are powered by ChainPoll, a hallucination detection methodology developed by Galileo Labs. During the creation of the index, Galileo’s evaluation metrics were proven to detect hallucinations with 87% accuracy, finally giving teams a reliable way to automatically detect hallucination risk saving teams time and cost typically spent on manual evaluation.
By helping teams catch errors of stale knowledge, wrong knowledge, logical fallacies and mathematical errors, Galileo hopes to help organizations find the perfect LLM for their use case, move from sandbox to production and more quickly deploy reliable and trustworthy AI.
Additional Resources:
- Read the ChainPoll: A high efficacy method for LLM hallucination detection paper: https://arxiv.org/abs/2310.18344
- Read the Hallucination Index blog: https://www.rungalileo.io/blog/hallucination-index
Sign up for the free insideAI News newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideAI NewsNOW