Measuring AI Hallucinations

Generative AI (GenAI) holds enormous potential across a wide range of use cases in the clinical development space, but its tendency to hallucinate must be carefully managed to ensure its safe deployment, particularly in regulated industries. In the clinical development industry, where large language models (LLMs) are dealing with medical data or driving patient-related decisions, the consequences of AI hallucinations can be serious.

There are several ways to manage the impact of AI hallucinations. However, in order to mitigate their impact, they must first be measured. In this article, we take a closer look at how AI hallucinations can be successfully measured and benchmarked. 

What is an AI hallucination?

In the context of AI, a “hallucination” refers to a phenomenon where a GenAI model generates outputs that are incorrect, nonsensical, or not real in response to a user query. For example, a LLM may confidently report the weather in a city that does not exist, or provide fabricated references in an academic context. 

There are many reasons that a LLM may hallucinate. Most commonly, hallucinations are caused by the fact that LLMs are static, meaning, they lack up-to-date information. Additionally, they lack domain-specific knowledge, meaning they do not have access to private business data. 

The different types of AI hallucinations

There are three different kinds of AI hallucinations

The first is an input-conflicting hallucination, where the LLM generates an answer that is in conflict with the query the user has inputted. If a user asks for a dinner recipe, for example, and the LLM generates a breakfast recipe, this would classify as an input-conflicting hallucination. 

The second is a context-conflicting hallucination. In this instance, part of the LLM’s answer does not make sense in the context of the answer portion that preceded it. For example, if the LLM generates a chicken salad recipe in response to a request for a dinner recipe, then followed with a tip on how to properly sear steak. 

The third is a fact-conflicting hallucination, in which the LLM delivers an answer that contains information that is simply not true, presented as fact. 

A GenAI output may contain more than one kind of hallucination in a single answer, but what they all have in common is that they are stated as fact, and may be taken as such without additional verification. 

Measuring AI hallucinations

There are many ways to measure AI hallucinations, but two of the most effective are the FActScore method and the Med-HALT method. 

FActScore

FActScore, or fine-grained atomic evaluation of factual precision in long-form text generation, is used to analyze the accuracy of the various facts stated in an LLM output. 

The idea is that long-form text consists of a string of pieces of information that can each be either true or false. A single sentence may contain multiple pieces of information. By breaking the LLM output down into individual pieces of information, or atomic units of “fact”, and then checking those pieces of information for accuracy,  FActScore can rate the LLM for factual precision. ChatGPT, for example, has a FActScore of 66.7%, whereas Stable LM has a FActScore of only 10%.

Med-HALT

Med-HALT is a medical domain hallucination test for LLMs that proposes a two-tiered approach to evaluate the presence and impact of hallucinations in biomedical-generated LLM outputs. Med-HALT applies specifically to the regulated field of life sciences, and it proposes a two-tiered approach to measuring hallucinations. 

The first tier consists of three Reasoning Hallucination Tests, or RHTs: a false confidence test (FCT), a none of the above test (Nota), and a fake question test. 

The false confidence test involves presenting a multiple-choice medical question alongside a randomly suggested correct answer to the LLM, then asking it to evaluate the validity of the proposed answer, providing detailed explanations for its correctness or incorrectness while explaining why the other answers are wrong. 

A none of the above test presents the LLM with a multiple-choice medical question, in which the correct answer is replaced by “none of the above”. This requires the model to identify that the correct answer is “none of the above” by identifying that the other options are incorrect. 

A fake question test involves presenting the model with a fake or nonsensical medical question to assess whether or not it can correctly identify and handle the question. 

The second tier of the Med-HALT framework involves memory hallucination tests to ensure that the LLM is able to retrieve accurate, factual information from biomedical sources, such as PubMed. 

Using these tests, Med-HALT assesses the veracity of the LLM’s output. If the model gives an answer that does not align with validated, evidence-based medical knowledge, Med-HALT triggers a “halt” warning to inform the user that the generated answer could be inaccurate or unsafe.

Responsibly deploying GenAI to accelerate clinical trials

Saama’s suite of AI-powered clinical development tools leverages the power of GenAI to rapidly accelerate clinical development timelines. Along with extensive hallucination safeguards, Saama keeps the human firmly in the loop to ensure the total accuracy of GenAI outputs. 

If you’d like to find out how we’re deploying GenAI to transform clinical development processes, book a demo with a member of our team today. 

Recommended Reading