TLDR: The paper introduces the Bayesian Network LLM Fusion (BNLF) framework, which combines predictions from multiple large language models (FinBERT, RoBERTa, BERTweet) using a Bayesian Network for financial sentiment analysis. BNLF addresses challenges like LLM transparency, cost, and inconsistency by providing an interpretable, lightweight, and robust solution. It achieves significant accuracy improvements (up to 6%) across diverse financial datasets and offers insights into how different LLMs and data contexts influence sentiment predictions through probabilistic reasoning.
Large Language Models (LLMs) have revolutionized many areas, including sentiment analysis, which involves identifying and interpreting opinions in text. However, these powerful models come with their own set of challenges. They can often be opaque, making it difficult to understand how they arrive at their conclusions. Fine-tuning them for specific tasks can be costly and computationally intensive, and their performance can be inconsistent across different domains. To address these issues, researchers have proposed a novel framework called the Bayesian Network LLM Fusion (BNLF).
The BNLF framework offers a sophisticated approach to integrating predictions from multiple LLMs for sentiment analysis, particularly in the complex financial domain. Instead of relying on a single LLM, BNLF combines the strengths of three distinct models: FinBERT, RoBERTa, and BERTweet. This fusion is achieved through a probabilistic mechanism known as a Bayesian Network.
Understanding the BNLF Framework
At its core, BNLF operates as a late fusion strategy. This means it takes the individual sentiment predictions from each LLM and then combines them using a Bayesian Network. A Bayesian Network is a type of probabilistic graphical model that excels at representing systems with uncertainty and interdependence. Unlike simpler methods that might just average predictions or use majority voting, a Bayesian Network explicitly models the probabilistic relationships between the LLM predictions and the final sentiment outcome. This provides a more principled and interpretable way to fuse information.
The framework works in four main steps: First, input texts are drawn from various sources, including formal financial documents and informal social media content. Second, these texts are processed by the three chosen LLMs (FinBERT, RoBERTa, and BERTweet), each generating its own sentiment prediction. FinBERT is specialized for financial language, RoBERTa is a strong general-purpose model, and BERTweet is trained on Twitter data, making it adept at handling informal social media language. These models are chosen for their complementary coverage and efficiency, being medium-sized and practical for deployment without extensive GPU resources. Third, these individual predictions are fed into the Bayesian Network, which performs probabilistic inference. Finally, the network outputs a posterior sentiment distribution, which is then mapped to a discrete sentiment label (negative, neutral, or positive).
Also Read:
- Multi-Agent LLMs Enhance Future Event Predictions Through Structured Argumentation
- Enhancing Financial Question Answering with Metadata-Driven RAG Architectures
Key Advantages and Performance
The BNLF framework was rigorously evaluated across three diverse, human-annotated financial datasets: Financial PhraseBank (news-based), Twitter Financial News Sentiment (TFNS, tweets), and FIQA (financial question-answering). The results demonstrated significant improvements. BNLF achieved an accuracy of 78.6% on the combined test set, outperforming a strong external baseline (DistilRoBERTa) by approximately 5.3%. It also showed consistent gains in macro- and weighted-F1 scores, indicating balanced performance across different sentiment classes.
One of the most compelling aspects of BNLF is its enhanced interpretability and ability to perform causal reasoning. Through inference analysis, the researchers showed how the framework dynamically adjusts its sentiment predictions based on the type of corpus, even when individual LLMs provide identical inputs. For instance, with all LLMs predicting ‘negative’, the BNLF’s certainty and the balance between sentiment classes varied considerably depending on whether the text came from Financial PhraseBank or TFNS. Similarly, when LLMs disagreed, BNLF’s output shifted significantly based on the corpus type, highlighting its ability to resolve conflicting evidence in a context-aware manner.
Furthermore, an influence strength analysis revealed that FinBERT and RoBERTa had the strongest direct influence on BNLF’s final predictions, with BERTweet providing complementary signals. The corpus type also played a significant role, influencing the LLMs and indirectly affecting the BNLF’s output. This level of transparency helps users understand which models and contextual factors are most plausibly contributing to a given sentiment outcome, a critical feature for trustworthy AI systems.
In conclusion, the Bayesian Network LLM Fusion framework addresses critical challenges in applying LLMs for financial sentiment analysis. It provides a robust, interpretable, and scalable solution that leverages the complementary strengths of multiple LLMs through probabilistic reasoning. This approach not only enhances predictive performance but also offers a clearer understanding of the decision-making process, moving towards more transparent and explicable AI systems. For more details, you can refer to the full research paper.


