spot_img
HomeResearch & DevelopmentEnhancing LLM Parallel Generations Through Interdependent Information Sharing

Enhancing LLM Parallel Generations Through Interdependent Information Sharing

TLDR: Bridge is a new method that allows Large Language Models (LLMs) to generate multiple responses in parallel while sharing information between them. By adding small “Bridge blocks” to the LLM architecture, it enables interdependent generations, significantly improving accuracy and consistency in tasks like math reasoning, with minimal additional parameters and full flexibility in generation width.

Large Language Models (LLMs) have made incredible strides in performance, especially when it comes to complex tasks. A common approach to improve LLM outputs is to scale inference-time compute, which often involves generating multiple responses for a single input prompt. However, a significant limitation of this method has been that these parallel responses are typically generated independently, meaning they don’t share any information with each other. This leaves a lot of potentially useful insights untapped, limiting the overall quality of the generated response set.

Researchers from Meta, Carnegie Mellon University, and Yale University have introduced a novel approach called “Bridge” that aims to overcome this challenge. Bridge allows LLMs to generate interdependent responses in parallel, fundamentally rethinking how batched LLM hidden states are processed. Instead of treating them as independent slices, Bridge views them as holistic tensors, enabling information to flow between different generation sequences.

How Bridge Works

At its core, Bridge introduces small, efficient architectural additions called “Bridge blocks” into existing LLMs. These blocks are placed after each feedforward block and add only a minimal amount of new parameters (ranging from 2.8% to 5.1% of the original model’s parameters). The magic happens within these Bridge blocks: they perform attention operations not just within a single sequence, but across different sequences that originate from the same input prompt at each step of the generation process. This is similar to a concept known as axial attention, where the model can blend information across different dimensions of its internal representations.

Crucially, Bridge maintains full generation parallelism. This means that while the responses are now interdependent and share information, the actual generation of tokens for each response still happens simultaneously. The design is also flexible, allowing for any number of parallel generations at test-time without needing to be retrained for different “widths” of parallelism.

Also Read:

Significant Performance Gains

The impact of Bridge on LLM performance is substantial. The research demonstrates that Bridge significantly improves the relative mean accuracy gains from reinforcement learning with verifiable rewards (RLVR) by up to 50%. This means that when LLMs are fine-tuned with human feedback or verifiable outcomes, Bridge helps them learn and improve much more effectively. It also boosts the consistency of correct responses, leading to higher quality and more reliable output sets.

For example, experiments on various math reasoning benchmarks showed that a DeepSeek-R1-Distill-Qwen-7B model equipped with Bridge blocks achieved a 50% further improvement with RLVR compared to the next best method. Furthermore, the rate at which all responses to a single competition math problem were correct increased from 15.3% to 18.1% with Bridge.

Bridge also proves to be highly versatile. Once trained, it scales seamlessly to any generation width, consistently outperforming independent generation methods. It’s robust to discrepancies between the width used during training and testing. The method also generalizes well to longer generation lengths, showing improved accuracy and consistency even when extrapolating beyond its training length.

This innovative approach unlocks a more general mode of parallel scaling, effectively leveraging information between sequences. It is also compatible with any post-generation aggregation technique, meaning it can be combined with other methods that synthesize multiple responses into a final output.

The full details of this research can be found in the paper: Generalized Parallel Scaling with Interdependent Generations.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -