TLDR: SHERPA is a model-driven framework that uses hierarchical state machines to provide structured reasoning and control over Large Language Model (LLM) execution. It integrates domain-specific best practices, improving LLM performance on complex tasks like code generation, class name generation, and question answering, especially for smaller LLMs and tasks with established workflows, while also allowing for cost optimization through efficient state machine design. The framework enhances reliability and allows for flexible, modular development of LLM-based applications.
Large Language Models (LLMs) have become incredibly powerful, excelling in many areas from writing code to answering complex questions. However, despite their impressive abilities, LLMs often struggle with tasks that require structured reasoning or specific domain knowledge, especially when that knowledge isn’t widely available in their training data. This can lead to inconsistent or even incorrect outputs, a phenomenon sometimes referred to as ‘hallucination’. Existing methods like Chain-of-Thought prompting help, but they often lack a general way to truly control how an LLM behaves through a complex task.
Enter SHERPA, a new model-driven framework designed to bring much-needed structure and control to LLM execution. SHERPA tackles this challenge by explicitly incorporating domain-specific best practices into hierarchical state machines. Think of a state machine as a detailed flowchart that guides the LLM through a task, breaking it down into smaller, manageable steps with clear rules for progression.
How SHERPA Works
At its core, SHERPA defines each LLM-powered task as an ‘agent’ associated with a state machine. When a user or another system interacts with the agent, it sends an ‘event’ to the state machine. This event, along with other relevant information, is stored in the agent’s ‘belief’—a structured memory that keeps track of the task’s history and intermediate results. The state machine then uses this information, along with predefined rules or even another LLM (acting as a ‘policy’), to decide the next best step or ‘transition’.
This approach allows for fine-grained control over the LLM’s behavior. For instance, if a task requires generating code, the state machine can first guide the LLM to generate test cases, then the code itself, and then check if the code passes the tests. If it fails, the state machine can direct the LLM to refine the code, rather than starting from scratch or making an uncontrolled guess. This systematic decomposition of tasks, inspired by human best practices, significantly enhances the LLM’s performance and reliability.
Key Components Explained Simply:
- State Machine: This is the blueprint, defining explicit states (like ‘Generating Test Cases’ or ‘Extracting Objects’) and transitions (rules for moving between states). It can be hierarchical, meaning complex tasks can be broken into sub-tasks within larger states.
- Policy: This is the decision-maker. It can be a set of simple rules (e.g., ‘if code fails tests, retry’) or an LLM itself, which analyzes the current situation and available options to choose the next best transition.
- Belief: This acts as the agent’s memory, storing the sequence of steps taken (trajectory), a log of all actions performed, and a key-value store for task-specific information. This ensures the LLM has all necessary context at each step.
Also Read:
- Agentic Reinforcement Learning: Empowering LLMs as Autonomous Decision-Makers
- Smart Planning for LLM Agents: Balancing Speed and Expense
Real-World Applications and Benefits
The researchers demonstrated SHERPA’s effectiveness across a variety of tasks:
- Code Generation: By guiding LLMs through a ‘test-driven’ development process, SHERPA significantly improved the accuracy of generated code, ensuring it passed more tests.
- Class Name Generation: For tasks in model-driven engineering, where LLMs generate components of software models, SHERPA’s iterative refinement process led to more accurate and complete class names.
- Question Answering: For complex questions based on visual scene graphs, SHERPA helped LLMs classify question types and apply tailored strategies (e.g., counting objects deterministically after extraction), leading to better answers.
The evaluation showed that integrating state machines with SHERPA generally improved LLM performance in 12 out of 15 tested scenarios. This benefit was particularly pronounced for smaller LLMs, which gained more from the structured guidance, and for tasks with well-established human best practices. Furthermore, the framework allows for flexible state machine design, meaning engineers can optimize workflows to reduce the number of LLM calls, thereby managing computational costs without sacrificing performance.
SHERPA represents a significant step forward in making LLMs more reliable and controllable for complex, domain-specific tasks. By decoupling the state machine design from the underlying actions, it enables rapid experimentation and optimization, paving the way for more robust and efficient AI applications. You can read the full research paper here.


