TLDR: This paper introduces a unified framework for understanding and evaluating how Large Language Models (LLMs) solve problems, contrasting traditional logic-based computation with natural language-based problem-solving. It defines distinct problem spaces (Formal, Natural Language, and LLM-addressable) and proposes a vector-valued ‘trust index’ (Q) to measure solution quality across multiple dimensions, moving beyond binary correctness to a continuous spectrum of ‘goodness’. The paper formalizes the concept of a ‘good enough’ solution for LLMs, considering both expected quality and variance, and outlines an iterative process for setting evaluation thresholds. Additionally, it introduces two new metrics: normalized bi-semantic entropy for assessing semantic robustness and emotional valence for measuring the emotional impact of LLM outputs, illustrating these concepts with a toy model application.
The world of problem-solving has long been dominated by classical computation, a realm of formal, logical systems that excel at tasks with clear, unambiguous rules. Think of sorting a list or executing a payment – these are problems where the answer is either perfectly right or perfectly wrong. This traditional approach has powered technological progress for decades, but it leaves a vast array of human problems untouched. These are the challenges characterized by ambiguity, dynamic environments, and subjective contexts, such as navigating a financial crisis or boosting team morale.
The emergence of Large Language Models (LLMs) marks a significant shift, allowing computational systems to engage with these previously inaccessible, natural language-based problems. A new research paper, “From Logic to Language: A Trust Index for Problem Solving with LLMs”, introduces a unified framework to understand and contrast these two distinct problem-solving paradigms.
Understanding Problem Spaces
The paper delineates different problem spaces. The first is the ‘Formal Problem Space’ (PFormal), which includes problems with well-defined inputs, outputs, and explicit rules, where correctness is objectively verifiable. Examples include compiling code or sorting data. The solution quality here is binary: either 1 (correct) or 0 (incorrect).
In contrast, the ‘Natural Language Addressable Problem Space’ (PNL) encompasses problems whose definition, exploration, or solution primarily involves natural language. These often have ill-defined goals, implicit rules, and depend heavily on context or subjective judgment. Solutions are more like ‘plans’ or ‘strategies’ rather than deterministic outputs, such as devising a marketing campaign or responding to a PR crisis. Historically, this was almost exclusively the domain of human cognition.
LLMs introduce a third space, ‘PLLM’, which significantly overlaps with PNL and even some parts of PFormal. LLMs can process and generate natural language, allowing them to tackle problems previously beyond computational reach. They can also generate code or explain formal concepts, bridging the gap between the two. The key insight is that LLMs expand the computationally addressable problem space into areas of PNL that were previously inaccessible to machines.
Introducing the Trust Index (Q)
For formal problems, evaluating a solution is straightforward: it’s either correct or not. However, for natural language problems, evaluation is far more nuanced. To address this, the paper proposes a ‘Trust Index’ (Q), a universal solution quality function. Instead of a simple binary outcome, Q maps a solution to a vector in an N-dimensional space, where N represents the number of relevant quality dimensions (e.g., factuality, clarity, safety, emotional tone). A value of 1 in a dimension signifies hypothetical perfection, while 0 indicates complete failure.
For LLM-based natural language problems, solutions exist on a spectrum of ‘goodness’ across these multiple dimensions. A perfect score across all dimensions might be conceptually impossible. Therefore, the goal is to find a solution that is ‘good enough’. This ‘good enough’ concept is formalized, meaning the expected quality for each dimension must meet a certain threshold, and the variance (or consistency) of that quality must stay within predefined bounds.
The evaluation of these dimensions must be rooted in statistical ideas, often involving ensembles of human or LLM-based evaluators to account for subjective criteria and inherent statistical nature of LLMs. The paper outlines an iterative refinement process, where human feedback helps set the ‘good enough’ thresholds for LLM-generated solutions.
New Quality Metrics for LLMs
The paper also introduces two novel quality metrics to fill gaps in current LLM evaluation: Normalized Bi-Semantic Entropy and Emotional Valence.
Normalized Bi-Semantic Entropy measures the diversity of meaning in potential responses and the robustness of an LLM’s answers to different phrasings of the same question. A low score indicates high robustness and semantic consistency (desirable for factual queries), while a high score suggests brittleness or sensitivity to prompt phrasing (which might be desirable for creative tasks).
Emotional Valence measures the emotional attachment or response elicited by a given solution. It maps an answer to a scalar value, where values close to 1 represent desirable emotional states (e.g., inspired, safe) and values close to 0 represent undesirable states (e.g., offended, scared). The goal is to maximize the expected emotional valence across a target population and minimize its variance, indicating a consensual and positively received response.
To illustrate these concepts, the authors developed a simple toy model application using Streamlit and Ollama, allowing interactive exploration of emotional valence and normalized bi-semantic entropy in simulated environments.
Also Read:
- Evaluating LLM Robustness: A New Protocol for Multiple-Choice Question Assessment
- Navigating the Ideaverse: How AI Explores Latent Spaces for Breakthrough Creativity
Conclusion
This research provides a formal framework for understanding and evaluating problem-solving in the age of LLMs. By moving beyond purely logic-based computation and formalizing concepts like problem spaces, the trust index, ambiguity, and subjectivity, it lays the groundwork for a more mature science of LLM-based systems. This paradigm shift opens new frontiers for digitalizing language-intensive domains, enabling AI to tackle problems previously exclusive to human cognition.


