TLDR: This research paper introduces a framework that combines large language models (LLMs) with combinatorial inference to improve structured prediction tasks. It systematically explores various prompting strategies for estimating LLM confidence and different fine-tuning methods, demonstrating that adding symbolic inference consistently enhances prediction accuracy and structural consistency. The study finds that true/false prompting is effective for confidence estimation and that structured fine-tuning significantly boosts performance on challenging tasks like morality framing and coreference resolution.
Large Language Models (LLMs) have transformed how we approach many language tasks, offering impressive capabilities without needing specific fine-tuning for every new problem. However, these powerful models often face challenges, particularly with generating factual information (hallucinations) and handling complex reasoning, largely due to how they are designed to predict the next word in a sequence.
A new research paper, “Mapping the Course for Prompt-based Structured Prediction”, by Matt Pauk and Maria Leonor Pacheco from the University of Colorado Boulder, proposes an innovative approach to tackle these limitations, especially in the realm of structured prediction. Structured prediction involves tasks where the output isn’t just a single word or phrase, but a complex object with multiple interconnected components, like parsing a sentence into its grammatical structure or identifying relationships between entities in a text. The core idea is to combine the predictive strength of LLMs with the logical consistency provided by combinatorial inference methods.
The Challenge of Structured Prediction with LLMs
Traditionally, LLMs have been applied to structured prediction by either treating each component of the structure as a separate prediction task or by asking the model to generate the entire structure as a single sequence of text. The problem with these methods is that LLMs, by themselves, don’t have a built-in mechanism to guarantee that the generated structure is logically valid or consistent. For example, in coreference resolution (identifying when different mentions refer to the same entity), if an LLM says A refers to B, and B refers to C, it might not automatically ensure that A also refers to C, leading to inconsistencies.
Previous work has shown the potential of combining LLMs with inference algorithms to enforce structural dependencies. However, a crucial missing piece was understanding how to reliably extract “confidence” scores from LLMs that could be used by these inference algorithms. Unlike traditional classifiers that are trained to output probabilities, LLMs provide probabilities for individual tokens, making it difficult to gauge their certainty about a complex prediction.
A New Framework for Consistent Predictions
The researchers introduce a general framework that uses LLMs to score potential sub-structures within a larger prediction problem. These scores are then fed into a combinatorial inference process, specifically Integer Linear Programming (ILP). ILP is a powerful mathematical tool that can find the best overall solution while strictly adhering to a set of predefined structural rules or constraints. This ensures that the final output is not only predicted by the LLM but also logically sound.
Strategies for Estimating LLM Confidence
A significant part of the study focused on how to best estimate confidence values from LLMs for use with the inference engine. The paper categorizes these methods into two types:
- White-Box Methods: These require access to the internal token probabilities of the LLM.
- True/False Token Prediction: The task is framed as a true/false question, and confidence is based on the probability of the LLM generating “true.”
- Multiple Choice: The problem is presented as a multiple-choice question, and confidence is derived from the probability of generating the token corresponding to the chosen option.
- Generative Classification: Instead of predicting a label, the LLM is prompted with a label and asked to generate the input text, with confidence based on how well it generates the text.
- Black-Box Methods: These work even when internal token probabilities are not accessible, relying only on the LLM’s plain-text generations.
- Generation Sampling: The model is prompted multiple times, and consistency across its different generations is used as a proxy for confidence.
- Verbalized Confidence: The LLM is directly asked to state its confidence level (e.g., on a scale of 0-100) for a given answer.
Learning to Improve Structured Prediction
Beyond just prompting, the paper also explores several learning strategies to further align LLMs with structured prediction objectives:
- Few-Shot Score Calibration: A small logistic regression layer is trained on top of the LLM’s confidence scores. This can be done locally (for individual sub-problems) or globally (using a structured hinge loss that considers the entire structure).
- Supervised Fine-tuning: The LLM itself is fine-tuned using standard next-token prediction objectives on the specific tasks.
- Structured Fine-tuning: This advanced method backpropagates the structured hinge loss directly into the LLM, allowing the model to learn from the global structural constraints.
Also Read:
- Calibrating Large Language Models with a Structured Play Framework
- Unifying AI Reasoning: How a New Framework Enhances LLM Problem-Solving
Key Findings and Impact
The framework was evaluated on two challenging discourse-level tasks: morality framing in political tweets and coreference resolution. The results were compelling:
- Inference Helps: Across all confidence estimation strategies, adding combinatorial inference consistently improved performance compared to using LLM prompting alone. This highlights the value of enforcing structural consistency.
- Best Confidence Estimation: Formulating prompts as true/false questions and using the probability of generating the “true” token proved to be the most effective method for extracting confidence scores.
- Structured Learning Boosts Performance: Fine-tuning LLMs using structured prediction objectives, especially the structured fine-tuning method, led to significant performance gains. This approach even surpassed previous state-of-the-art methods that relied on classical deep structured prediction.
In conclusion, this research demonstrates that combining the powerful generative capabilities of LLMs with the logical rigor of combinatorial inference and targeted structured learning can lead to more accurate and consistent predictions for complex language tasks. This work, by Matt Pauk and Maria Leonor Pacheco, paves the way for more reliable and robust applications of LLMs in structured prediction, addressing some of their inherent limitations.


