Enhancing Trust in Multimodal AI Predictions with Functional Sampling

TLDR: FESTA (Functionally Equivalent Sampling for Trust Assessment) is a new method for evaluating the trustworthiness of multimodal LLM predictions. It uses “functionally equivalent” and “complementary” input samples to measure model consistency and sensitivity, generating an uncertainty score that helps detect mispredictions, including confident hallucinations, in black-box and unsupervised settings. Experiments show significant improvements in selective prediction performance for both vision and audio LLMs.

Multimodal Large Language Models (MLLMs) are powerful AI systems that can process and understand information from various sources, like text, images, and audio. While they’ve shown impressive capabilities, assessing how much we can trust their predictions, especially when they make mistakes, remains a significant challenge. This is particularly true in critical applications such as finance, medicine, or autonomous driving, where an incorrect prediction can have serious consequences.

A new approach called FESTA, which stands for Functionally Equivalent Sampling for Trust Assessment, has been proposed to tackle this problem. Developed by Debarpan Bhattacharya, Apoorva Kulkarni, and Sriram Ganapathy, FESTA offers a novel way to quantify the uncertainty of MLLM predictions, even when dealing with models that are “black-box” (meaning their internal workings aren’t accessible) and without needing ground truth data (unsupervised).

The Challenge of Trusting MLLMs

Traditional methods for assessing the confidence of AI models often rely on things like log-probabilities or output entropy. However, for MLLMs, these methods often fall short. Log-probabilities might not be available for proprietary models, and their calibration can be disrupted during training. More importantly, MLLMs can sometimes produce “low-uncertainty hallucinations”—predictions that are confidently wrong, making standard entropy measures unreliable.

How FESTA Works: Consistency and Sensitivity

FESTA introduces two key concepts: Functional Equivalent Samples (FES) and Functional Complementary Samples (FCS). These are essentially different ways of probing the MLLM’s behavior to understand its consistency and sensitivity.

Functional Equivalent Samples (FES): Imagine you have an image and a question about it. An FES is a slightly altered version of that input (e.g., the same image in grayscale, or the question rephrased) where the underlying task and the ideal answer should remain the same. If the MLLM gives a different answer to an FES, it indicates a lack of consistency.
Functional Complementary Samples (FCS): An FCS is an altered input where the task remains the same, but the ideal answer should change. For example, if a question asks “Is the cat to the left of the car?” an FCS might involve flipping the image horizontally or rephrasing the question to “Is the cat to the right of the car?”. If the MLLM’s prediction doesn’t change when it should, it shows a lack of sensitivity.

By generating these FES and FCS samples and observing the MLLM’s responses, FESTA calculates an uncertainty score. This score is based on the KL-divergence, which measures how much the model’s predictive distribution deviates from that of an “ideally consistent” and “ideally sensitive” model. This allows FESTA to detect both high-uncertainty errors and those tricky low-uncertainty hallucinations.

Significant Improvements in Prediction Trust

The researchers conducted extensive experiments using various off-the-shelf MLLMs on both visual and audio reasoning tasks. The results were impressive. FESTA significantly improved selective prediction performance, which is the ability of a model to abstain from answering when it’s highly uncertain, thus avoiding wrong predictions. For vision-LLMs, FESTA showed a 33.3% relative improvement, and for audio-LLMs, a 29.6% relative improvement in detecting mispredictions, based on the AUROC metric.

Even for tasks where MLLMs performed poorly (e.g., audio temporal reasoning with accuracies as low as 30%), FESTA still achieved high AUROC scores (up to 0.89), demonstrating its robustness in challenging scenarios. The method is also “black-box,” meaning it only requires input-output access to the MLLM, making it applicable to a wide range of models, including proprietary ones.

Also Read:

Future Directions and Impact

While FESTA marks a significant step forward, the authors acknowledge areas for future work, such as extending it to open-ended text generation and multimodal outputs beyond multiple-choice questions. The computational overhead of generating many samples is also a consideration, though the method’s robustness with fewer samples was noted.

Overall, FESTA provides a robust and unsupervised framework for assessing the trustworthiness of MLLM predictions. By effectively identifying and abstaining from incorrect predictions, especially low-uncertainty hallucinations, it paves the way for safer and more reliable AI deployments in critical applications. For more technical details, you can refer to the full research paper: FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Trust in Multimodal AI Predictions with Functional Sampling

The Challenge of Trusting MLLMs

How FESTA Works: Consistency and Sensitivity

Significant Improvements in Prediction Trust

Future Directions and Impact

Gen AI News and Updates

Anthropic Reveals First AI-Orchestrated Cyber Espionage Campaign by Chinese State-Sponsored Group

Google Bolsters AI Agent Safeguards with Enhanced Safety Frameworks

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates