spot_img
HomeResearch & DevelopmentEnhancing Trust in Multimodal AI Predictions with Functional Sampling

Enhancing Trust in Multimodal AI Predictions with Functional Sampling

TLDR: FESTA (Functionally Equivalent Sampling for Trust Assessment) is a new method for evaluating the trustworthiness of multimodal LLM predictions. It uses “functionally equivalent” and “complementary” input samples to measure model consistency and sensitivity, generating an uncertainty score that helps detect mispredictions, including confident hallucinations, in black-box and unsupervised settings. Experiments show significant improvements in selective prediction performance for both vision and audio LLMs.

Multimodal Large Language Models (MLLMs) are powerful AI systems that can process and understand information from various sources, like text, images, and audio. While they’ve shown impressive capabilities, assessing how much we can trust their predictions, especially when they make mistakes, remains a significant challenge. This is particularly true in critical applications such as finance, medicine, or autonomous driving, where an incorrect prediction can have serious consequences.

A new approach called FESTA, which stands for Functionally Equivalent Sampling for Trust Assessment, has been proposed to tackle this problem. Developed by Debarpan Bhattacharya, Apoorva Kulkarni, and Sriram Ganapathy, FESTA offers a novel way to quantify the uncertainty of MLLM predictions, even when dealing with models that are “black-box” (meaning their internal workings aren’t accessible) and without needing ground truth data (unsupervised).

The Challenge of Trusting MLLMs

Traditional methods for assessing the confidence of AI models often rely on things like log-probabilities or output entropy. However, for MLLMs, these methods often fall short. Log-probabilities might not be available for proprietary models, and their calibration can be disrupted during training. More importantly, MLLMs can sometimes produce “low-uncertainty hallucinations”—predictions that are confidently wrong, making standard entropy measures unreliable.

How FESTA Works: Consistency and Sensitivity

FESTA introduces two key concepts: Functional Equivalent Samples (FES) and Functional Complementary Samples (FCS). These are essentially different ways of probing the MLLM’s behavior to understand its consistency and sensitivity.

  • Functional Equivalent Samples (FES): Imagine you have an image and a question about it. An FES is a slightly altered version of that input (e.g., the same image in grayscale, or the question rephrased) where the underlying task and the ideal answer should remain the same. If the MLLM gives a different answer to an FES, it indicates a lack of consistency.
  • Functional Complementary Samples (FCS): An FCS is an altered input where the task remains the same, but the ideal answer should change. For example, if a question asks “Is the cat to the left of the car?” an FCS might involve flipping the image horizontally or rephrasing the question to “Is the cat to the right of the car?”. If the MLLM’s prediction doesn’t change when it should, it shows a lack of sensitivity.

By generating these FES and FCS samples and observing the MLLM’s responses, FESTA calculates an uncertainty score. This score is based on the KL-divergence, which measures how much the model’s predictive distribution deviates from that of an “ideally consistent” and “ideally sensitive” model. This allows FESTA to detect both high-uncertainty errors and those tricky low-uncertainty hallucinations.

Significant Improvements in Prediction Trust

The researchers conducted extensive experiments using various off-the-shelf MLLMs on both visual and audio reasoning tasks. The results were impressive. FESTA significantly improved selective prediction performance, which is the ability of a model to abstain from answering when it’s highly uncertain, thus avoiding wrong predictions. For vision-LLMs, FESTA showed a 33.3% relative improvement, and for audio-LLMs, a 29.6% relative improvement in detecting mispredictions, based on the AUROC metric.

Even for tasks where MLLMs performed poorly (e.g., audio temporal reasoning with accuracies as low as 30%), FESTA still achieved high AUROC scores (up to 0.89), demonstrating its robustness in challenging scenarios. The method is also “black-box,” meaning it only requires input-output access to the MLLM, making it applicable to a wide range of models, including proprietary ones.

Also Read:

Future Directions and Impact

While FESTA marks a significant step forward, the authors acknowledge areas for future work, such as extending it to open-ended text generation and multimodal outputs beyond multiple-choice questions. The computational overhead of generating many samples is also a consideration, though the method’s robustness with fewer samples was noted.

Overall, FESTA provides a robust and unsupervised framework for assessing the trustworthiness of MLLM predictions. By effectively identifying and abstaining from incorrect predictions, especially low-uncertainty hallucinations, it paves the way for safer and more reliable AI deployments in critical applications. For more technical details, you can refer to the full research paper: FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -