TLDR: OPTAGENT is a novel framework that optimizes e-commerce search queries by using a multi-agent simulation and genetic algorithms. LLM-based agents act as diverse shopping customers, evaluating product relevance and purchase intent to create a dynamic ‘fitness score’ for queries. This score then guides an evolutionary algorithm to iteratively refine queries. The framework significantly improves query performance, especially for challenging, infrequent ‘tail queries’, demonstrating a powerful method for optimizing LLMs in subjective domains without relying on traditional, static reward signals.
In the fast-paced world of e-commerce, millions of users search for products daily on platforms like Amazon and Etsy. Often, these queries are short, ambiguous, or contain typos, making it challenging for search engines to accurately understand user intent. This problem, known as Query Rewriting (QR), is crucial for connecting shoppers with the products they truly desire. However, evaluating whether a rewritten query genuinely captures user intent is a subjective task, lacking a single ‘correct’ answer, which makes traditional optimization methods difficult.
Understanding the Challenge
Large Language Models (LLMs) have shown remarkable capabilities in various tasks, especially those with clear, verifiable solutions like coding or mathematics. But for subjective tasks like e-commerce query rewriting, where the ‘gold standard’ is elusive, their adoption faces hurdles. Existing methods often rely on human feedback, which is costly and slow, or a single LLM acting as a judge. However, a single LLM judge can be prone to biases, lack robustness, and be unreliable when evaluating complex criteria.
Introducing OPTAGENT: A Novel Approach
A new framework called OPTAGENT addresses this fundamental challenge by combining multi-agent simulations with genetic algorithms to verify and optimize queries for e-commerce query rewriting. Instead of a static reward model or a single LLM judge, OPTAGENT employs multiple LLM-based agents, each simulating a unique shopping customer, to provide a dynamic reward signal. This collective judgment forms an effective ‘fitness function’ for an evolutionary algorithm that continuously refines the user’s initial query.
How OPTAGENT Works
The OPTAGENT framework operates in two primary stages: multi-agent evaluation and genetic algorithm optimization.
Multi-Agent Evaluation: OPTAGENT uses an ensemble of LLM-based agents, each acting as a simulated user. To ensure diverse reasoning paths and avoid biases that can arise from predefined ‘personas,’ each agent is initialized with a different ‘temperature’ setting. A lower temperature leads to more deterministic outputs, while a higher temperature encourages more exploratory and varied responses. For a given rewritten query, each agent searches the shopping platform, analyzes the first page of results (product title, description, image, price, reviews, shipping), and assigns a semantic relevance score (Fully Relevant, Partially Relevant, or Irrelevant) to each product. After evaluating all products, the agent decides which products it would ‘purchase’ and calculates a total raw purchase value. These individual judgments are then aggregated into a single, continuous fitness score, which considers the average semantic score of the top-10 products, the average semantic score of all retrieved products, and a normalized purchase value.
Genetic Algorithms for Optimization: The simulation-based fitness function guides a genetic algorithm, which is inspired by natural selection. This algorithm is robust to the ‘noisy’ and subjective nature of the fitness score. The process involves:
- Initial Population Generation: An LLM generates several diverse, semantically similar versions of the original user query to start the optimization process.
- Selection: The top-performing queries (based on their fitness scores) from the current generation are directly passed to the next generation.
- Crossover: With a certain probability, two parent queries are selected, and an LLM combines their meaningful semantic elements to create a new ‘child’ query.
- Mutation: With another probability, a selected query undergoes a small but meaningful alteration (e.g., using a synonym, reordering words) by an LLM to create a new variant.
This iterative process continues for a fixed number of generations, with the goal of discovering high-fitness queries. The final output is the query with the highest fitness score found throughout the entire evolutionary process.
Key Findings and Performance
OPTAGENT was evaluated on a dataset of 1000 real-world e-commerce queries across five categories. The results showed significant improvements:
- On average, OPTAGENT improved query fitness by 21.98% over the original user query.
- It outperformed a Best-of-N LLM rewriting baseline by 3.36%.
- The framework was particularly effective for ‘tail queries’ (infrequent search terms), showing the largest relative improvement (28.67%). This is crucial because traditional methods struggle with tail queries due to a lack of historical data.
- Performance consistently improved across generations, indicating that the evolutionary operators effectively discover better queries over time.
- An ablation study confirmed that the evolutionary operations, especially the ‘crossover’ mechanism, are critical for achieving peak performance.
The evaluation agents also exhibited a position bias, similar to real users, preferring products listed higher in search results. While some limitations were noted (e.g., difficulty parsing hidden information in interactive website elements or over-reliance on customer reviews for new products), the agents showed a moderate and meaningful alignment with human judgment.
Also Read:
- AutoMaAS: A Self-Evolving Framework for Multi-Agent AI Systems
- AGENTRL: A Scalable Framework for Training Generalist LLM Agents
Conclusion
OPTAGENT offers a generalizable and scalable solution for optimizing LLMs in subjective domains where explicit reward signals are scarce. By replacing static reward functions with a dynamic fitness evaluation derived from a multi-agent simulation, it creates a rich and nuanced landscape that better captures the complexity of human preference. This approach, detailed further in the research paper available at arXiv:2510.03771, opens new avenues for developing more capable and aligned AI systems in a wide range of human-centric applications, particularly in e-commerce query rewriting.


