Enhancing AI Reasoning with Process Mining: Introducing PM4GRPO

TLDR: PM4GRPO is a novel reinforcement learning framework that significantly improves the multi-step reasoning capabilities of large AI models. Unlike traditional methods that only reward correct answers, PM4GRPO integrates “process mining” techniques to evaluate how closely an AI model’s internal thinking process aligns with that of a pre-trained expert teacher model. This “reasoning-aware” approach, which combines conformance rewards with standard outcome-centric rewards, has demonstrated superior performance across various complex mathematical reasoning benchmarks for both 1.5B and 7B parameter models.

Large Language Models, often referred to as Large Reasoning Models (LRMs) when they demonstrate multi-step reasoning, have shown remarkable capabilities in tackling complex tasks. A key element in developing these reasoning behaviors is a technique called reinforcement learning (RL)-based post-training, which helps align the model’s strategies with high-level reasoning goals.

Among these RL methods, Group Relative Policy Optimization (GRPO) has gained attention for its simpler design and more stable optimization compared to older approaches. However, a significant limitation of current GRPO-inspired methods is their focus primarily on the final answer. They often reward models based on correctness or superficial aspects like text length or formatting, overlooking the actual step-by-step reasoning process the model uses to arrive at an answer. This can lead to issues like overly verbose responses, speculative jumps in logic, or even accidental correctness without a true understanding of the problem.

To address this, researchers have introduced a novel framework called PM4GRPO, which stands for Reasoning-Aware Group Relative Policy Optimization using Process Mining. The core idea behind PM4GRPO is to treat the reasoning or thinking of an LRM as a process itself – a concept dubbed “Thinking is a Process” (THIP).

How PM4GRPO Works

PM4GRPO enhances the standard GRPO framework by incorporating a “conformance reward.” This reward evaluates how well the reasoning process of a policy model (the student) aligns with that of a pre-trained, high-performing teacher model. To achieve this, PM4GRPO utilizes techniques from Process Mining (PM), a field dedicated to analyzing process execution logs.

Specifically, for each problem, the policy model generates its own reasoning trace. This trace is then transformed into a process model using a technique called Inductive Miner. This student-generated process model is then compared against the teacher model’s reasoning process using Alignment-based Conformance Checking. This comparison yields two metrics: fitness (how well the student’s model explains the teacher’s process) and precision (how often actions not in the teacher’s process occur in the student’s). These two metrics are combined into an f1-score, which forms the conformance reward.

This conformance reward is then integrated with the existing reward components in GRPO-inspired methods: the format reward (ensuring the output adheres to structural requirements) and the answer reward (assessing the correctness of the final answer). This combined reward function guides the policy model during training, encouraging it to not only get the right answer but also to reason in a structured and aligned manner with an expert model.

The framework operates on a sequence-level optimization, meaning rewards are computed after the entire reasoning process for a problem is completed, rather than at each token. This is facilitated by Group Sequence Policy Optimization (GSPO), which aligns the optimization unit with the sequence-level reward structure.

Experimental Success

To evaluate PM4GRPO, experiments were conducted using 1.5B and 7B parameter backbone models. The models were trained on the DeepMath-103k dataset and evaluated on five challenging mathematical benchmarks: MATH500, OlympiadBench, Minerva, AIME24, and AIME25.

The results demonstrated that PM4GRPO consistently outperformed existing methodologies for GRPO-based post-training. For 7B models, PM4GRPO achieved the highest scores on MATH500 (91.1%) and Olympiad Bench (61.1%), surpassing strong baselines. It also showed competitive performance on Minerva Math and clear advantages on the challenging AIME24 and AIME25 benchmarks, indicating superior reasoning and generalization ability.

Similarly, for 1.5B models, PM4GRPO achieved the best overall performance across MATH500, Olympiad Bench, and Minerva Math, slightly outperforming previous strong baselines. While another model achieved higher scores on AIME24 and AIME25 for the 1.5B scale, PM4GRPO showed more balanced performance across all benchmarks, suggesting better generalization and stability even in smaller model configurations.

Also Read:

Conclusion

The introduction of PM4GRPO marks a significant step forward in enhancing the reasoning capabilities of large AI models. By leveraging process mining techniques to evaluate and reward the reasoning procedure itself, rather than just the final outcome, PM4GRPO effectively improves the post-training of LRMs. These findings highlight the potential of process mining as a powerful tool for quantitatively assessing reasoning processes in the context of reinforcement learning for AI. You can find more details about this research in the full paper. Read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing AI Reasoning with Process Mining: Introducing PM4GRPO

How PM4GRPO Works

Experimental Success

Conclusion

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates