SHERPA: Guiding Large Language Models with Structured Workflows

TLDR: SHERPA is a model-driven framework that uses hierarchical state machines to provide structured reasoning and control over Large Language Model (LLM) execution. It integrates domain-specific best practices, improving LLM performance on complex tasks like code generation, class name generation, and question answering, especially for smaller LLMs and tasks with established workflows, while also allowing for cost optimization through efficient state machine design. The framework enhances reliability and allows for flexible, modular development of LLM-based applications.

Large Language Models (LLMs) have become incredibly powerful, excelling in many areas from writing code to answering complex questions. However, despite their impressive abilities, LLMs often struggle with tasks that require structured reasoning or specific domain knowledge, especially when that knowledge isn’t widely available in their training data. This can lead to inconsistent or even incorrect outputs, a phenomenon sometimes referred to as ‘hallucination’. Existing methods like Chain-of-Thought prompting help, but they often lack a general way to truly control how an LLM behaves through a complex task.

Enter SHERPA, a new model-driven framework designed to bring much-needed structure and control to LLM execution. SHERPA tackles this challenge by explicitly incorporating domain-specific best practices into hierarchical state machines. Think of a state machine as a detailed flowchart that guides the LLM through a task, breaking it down into smaller, manageable steps with clear rules for progression.

How SHERPA Works

At its core, SHERPA defines each LLM-powered task as an ‘agent’ associated with a state machine. When a user or another system interacts with the agent, it sends an ‘event’ to the state machine. This event, along with other relevant information, is stored in the agent’s ‘belief’—a structured memory that keeps track of the task’s history and intermediate results. The state machine then uses this information, along with predefined rules or even another LLM (acting as a ‘policy’), to decide the next best step or ‘transition’.

This approach allows for fine-grained control over the LLM’s behavior. For instance, if a task requires generating code, the state machine can first guide the LLM to generate test cases, then the code itself, and then check if the code passes the tests. If it fails, the state machine can direct the LLM to refine the code, rather than starting from scratch or making an uncontrolled guess. This systematic decomposition of tasks, inspired by human best practices, significantly enhances the LLM’s performance and reliability.

Key Components Explained Simply:

State Machine: This is the blueprint, defining explicit states (like ‘Generating Test Cases’ or ‘Extracting Objects’) and transitions (rules for moving between states). It can be hierarchical, meaning complex tasks can be broken into sub-tasks within larger states.
Policy: This is the decision-maker. It can be a set of simple rules (e.g., ‘if code fails tests, retry’) or an LLM itself, which analyzes the current situation and available options to choose the next best transition.
Belief: This acts as the agent’s memory, storing the sequence of steps taken (trajectory), a log of all actions performed, and a key-value store for task-specific information. This ensures the LLM has all necessary context at each step.

Also Read:

Real-World Applications and Benefits

The researchers demonstrated SHERPA’s effectiveness across a variety of tasks:

Code Generation: By guiding LLMs through a ‘test-driven’ development process, SHERPA significantly improved the accuracy of generated code, ensuring it passed more tests.
Class Name Generation: For tasks in model-driven engineering, where LLMs generate components of software models, SHERPA’s iterative refinement process led to more accurate and complete class names.
Question Answering: For complex questions based on visual scene graphs, SHERPA helped LLMs classify question types and apply tailored strategies (e.g., counting objects deterministically after extraction), leading to better answers.

The evaluation showed that integrating state machines with SHERPA generally improved LLM performance in 12 out of 15 tested scenarios. This benefit was particularly pronounced for smaller LLMs, which gained more from the structured guidance, and for tasks with well-established human best practices. Furthermore, the framework allows for flexible state machine design, meaning engineers can optimize workflows to reduce the number of LLM calls, thereby managing computational costs without sacrificing performance.

SHERPA represents a significant step forward in making LLMs more reliable and controllable for complex, domain-specific tasks. By decoupling the state machine design from the underlying actions, it enables rapid experimentation and optimization, paving the way for more robust and efficient AI applications. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

SHERPA: Guiding Large Language Models with Structured Workflows

How SHERPA Works

Key Components Explained Simply:

Real-World Applications and Benefits

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates