Structured Prediction: How Combining LLMs with Inference Boosts Accuracy

TLDR: This research paper introduces a framework that combines large language models (LLMs) with combinatorial inference to improve structured prediction tasks. It systematically explores various prompting strategies for estimating LLM confidence and different fine-tuning methods, demonstrating that adding symbolic inference consistently enhances prediction accuracy and structural consistency. The study finds that true/false prompting is effective for confidence estimation and that structured fine-tuning significantly boosts performance on challenging tasks like morality framing and coreference resolution.

Large Language Models (LLMs) have transformed how we approach many language tasks, offering impressive capabilities without needing specific fine-tuning for every new problem. However, these powerful models often face challenges, particularly with generating factual information (hallucinations) and handling complex reasoning, largely due to how they are designed to predict the next word in a sequence.

A new research paper, “Mapping the Course for Prompt-based Structured Prediction”, by Matt Pauk and Maria Leonor Pacheco from the University of Colorado Boulder, proposes an innovative approach to tackle these limitations, especially in the realm of structured prediction. Structured prediction involves tasks where the output isn’t just a single word or phrase, but a complex object with multiple interconnected components, like parsing a sentence into its grammatical structure or identifying relationships between entities in a text. The core idea is to combine the predictive strength of LLMs with the logical consistency provided by combinatorial inference methods.

The Challenge of Structured Prediction with LLMs

Traditionally, LLMs have been applied to structured prediction by either treating each component of the structure as a separate prediction task or by asking the model to generate the entire structure as a single sequence of text. The problem with these methods is that LLMs, by themselves, don’t have a built-in mechanism to guarantee that the generated structure is logically valid or consistent. For example, in coreference resolution (identifying when different mentions refer to the same entity), if an LLM says A refers to B, and B refers to C, it might not automatically ensure that A also refers to C, leading to inconsistencies.

Previous work has shown the potential of combining LLMs with inference algorithms to enforce structural dependencies. However, a crucial missing piece was understanding how to reliably extract “confidence” scores from LLMs that could be used by these inference algorithms. Unlike traditional classifiers that are trained to output probabilities, LLMs provide probabilities for individual tokens, making it difficult to gauge their certainty about a complex prediction.

A New Framework for Consistent Predictions

The researchers introduce a general framework that uses LLMs to score potential sub-structures within a larger prediction problem. These scores are then fed into a combinatorial inference process, specifically Integer Linear Programming (ILP). ILP is a powerful mathematical tool that can find the best overall solution while strictly adhering to a set of predefined structural rules or constraints. This ensures that the final output is not only predicted by the LLM but also logically sound.

Strategies for Estimating LLM Confidence

A significant part of the study focused on how to best estimate confidence values from LLMs for use with the inference engine. The paper categorizes these methods into two types:

White-Box Methods: These require access to the internal token probabilities of the LLM.
- True/False Token Prediction: The task is framed as a true/false question, and confidence is based on the probability of the LLM generating “true.”
- Multiple Choice: The problem is presented as a multiple-choice question, and confidence is derived from the probability of generating the token corresponding to the chosen option.
- Generative Classification: Instead of predicting a label, the LLM is prompted with a label and asked to generate the input text, with confidence based on how well it generates the text.
Black-Box Methods: These work even when internal token probabilities are not accessible, relying only on the LLM’s plain-text generations.
- Generation Sampling: The model is prompted multiple times, and consistency across its different generations is used as a proxy for confidence.
- Verbalized Confidence: The LLM is directly asked to state its confidence level (e.g., on a scale of 0-100) for a given answer.

Learning to Improve Structured Prediction

Beyond just prompting, the paper also explores several learning strategies to further align LLMs with structured prediction objectives:

Few-Shot Score Calibration: A small logistic regression layer is trained on top of the LLM’s confidence scores. This can be done locally (for individual sub-problems) or globally (using a structured hinge loss that considers the entire structure).
Supervised Fine-tuning: The LLM itself is fine-tuned using standard next-token prediction objectives on the specific tasks.
Structured Fine-tuning: This advanced method backpropagates the structured hinge loss directly into the LLM, allowing the model to learn from the global structural constraints.

Also Read:

Key Findings and Impact

The framework was evaluated on two challenging discourse-level tasks: morality framing in political tweets and coreference resolution. The results were compelling:

Inference Helps: Across all confidence estimation strategies, adding combinatorial inference consistently improved performance compared to using LLM prompting alone. This highlights the value of enforcing structural consistency.
Best Confidence Estimation: Formulating prompts as true/false questions and using the probability of generating the “true” token proved to be the most effective method for extracting confidence scores.
Structured Learning Boosts Performance: Fine-tuning LLMs using structured prediction objectives, especially the structured fine-tuning method, led to significant performance gains. This approach even surpassed previous state-of-the-art methods that relied on classical deep structured prediction.

In conclusion, this research demonstrates that combining the powerful generative capabilities of LLMs with the logical rigor of combinatorial inference and targeted structured learning can lead to more accurate and consistent predictions for complex language tasks. This work, by Matt Pauk and Maria Leonor Pacheco, paves the way for more reliable and robust applications of LLMs in structured prediction, addressing some of their inherent limitations.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Structured Prediction: How Combining LLMs with Inference Boosts Accuracy

The Challenge of Structured Prediction with LLMs

A New Framework for Consistent Predictions

Strategies for Estimating LLM Confidence

Learning to Improve Structured Prediction

Key Findings and Impact

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates