TLDR: A new survey explores the transformative impact of Large Language Models (LLMs) on optimization modeling, a field traditionally requiring deep human expertise. The paper details advancements in data synthesis, model fine-tuning, inference frameworks, and evaluation methods. It highlights critical issues with the quality of existing benchmark datasets, proposing cleaned versions for more reliable comparisons. The survey also identifies key performing LLM frameworks and outlines future research avenues, including enhancing reasoning, explainability, domain knowledge integration, and human-LLM collaboration in optimization tasks.
Optimization modeling, a powerful tool for optimal decision-making across various industries, has traditionally required significant expertise from operations research professionals. This expertise barrier has limited its broader adoption, despite its potential to greatly enhance efficiency in areas like supply chain management, healthcare, and air traffic control.
However, the emergence of Large Language Models (LLMs) is creating new opportunities to automate this complex process. A recent survey titled “A Survey of Optimization Modeling Meets LLMs: Progress and Future Directions” explores how LLMs are transforming the field, making optimization more accessible and efficient. The paper, available at https://arxiv.org/pdf/2508.10047, provides a comprehensive review of advancements across the entire technical stack.
Automating Mathematical Modeling
The core idea is to translate natural language descriptions of optimization problems into formal mathematical models, including variables, constraints, and objective functions. This process, known as Natural Language for Optimization (NL4Opt), is challenging because it often involves understanding domain-specific terminology and inferring implicit constraints from text.
LLMs are proving capable of understanding these complex descriptions, identifying objectives, extracting variables, and building the mathematical models, even generating the necessary code. The survey categorizes the progress into several key areas:
- Domain-specific LLMs: Models like ORLM and LLMOPT are being fine-tuned with specialized data to improve their optimization modeling capabilities.
- Advanced Inference Frameworks: Techniques such as Chain-of-Experts and Tree of Thoughts are enhancing LLMs’ reasoning abilities for these problems.
- Benchmark Datasets and Evaluation: New datasets like IndustryOR and MAMO are being developed to test and compare different LLM approaches.
Addressing Data Quality and Evaluation Challenges
A significant finding highlighted in the survey is the surprisingly high error rate in existing benchmark datasets used for evaluating LLM performance in optimization modeling. Some datasets were found to have error rates exceeding 50%, which undermines the reliability of performance comparisons. The authors addressed this by manually cleaning these datasets and constructing a new leaderboard for fair evaluation.
The survey also points out that current benchmarks mostly cover simple to moderate problems, with a scarcity of truly complex cases. This imbalance suggests a need for more challenging datasets to push the boundaries of LLM capabilities.
Furthermore, evaluating optimization models generated by LLMs is complex. While some methods focus on the final objective value, others compare the generated model directly against a correct one. The survey notes inconsistencies in reported evaluation results across different studies due to varying base models, data preprocessing, and metrics. To provide a clearer picture, the authors conducted a unified evaluation of open-source methods using a cutting-edge LLM (GPT-4o) on their cleaned benchmarks.
Key Performance Insights
The unified evaluation revealed that Chain-of-Experts and ORLM are highly competitive frameworks. While Chain-of-Experts performs well on simpler tasks, ORLM shows stronger performance on more complex datasets, suggesting that models specifically trained for optimization may excel in challenging scenarios. Interestingly, the popular Chain-of-Thought (CoT) prompting method did not always outperform standard prompting, indicating it should be applied selectively.
Also Read:
- Streamlining Large Reasoning Models: A New Approach to Shorter, Smarter Outputs
- Boosting LLM Reasoning: A New Approach to Overcome Learning Plateaus
Future Directions
The survey concludes by outlining several promising future research directions:
- Reasoning Models: Developing LLMs that can perform more sophisticated, multi-step reasoning for optimization problems, potentially using reinforcement learning.
- Explainable Modeling Processes: Making the LLM’s modeling process more transparent and understandable for human experts, allowing for easier debugging and modification.
- Domain Knowledge Injection: Integrating specialized domain knowledge, possibly from knowledge graphs, into LLMs to improve their understanding and modeling accuracy.
- Human-in-the-Loop Modeling: Creating collaborative systems where human experts can provide input, clarifications, and insights at critical points during the LLM’s modeling process.
To support the research community, the authors have also developed an online portal that provides access to cleaned datasets, code repositories, and a leaderboard of existing solutions, along with updates on the latest research papers in this rapidly evolving field.


