TLDR: PM4GRPO is a novel reinforcement learning framework that significantly improves the multi-step reasoning capabilities of large AI models. Unlike traditional methods that only reward correct answers, PM4GRPO integrates “process mining” techniques to evaluate how closely an AI model’s internal thinking process aligns with that of a pre-trained expert teacher model. This “reasoning-aware” approach, which combines conformance rewards with standard outcome-centric rewards, has demonstrated superior performance across various complex mathematical reasoning benchmarks for both 1.5B and 7B parameter models.
Large Language Models, often referred to as Large Reasoning Models (LRMs) when they demonstrate multi-step reasoning, have shown remarkable capabilities in tackling complex tasks. A key element in developing these reasoning behaviors is a technique called reinforcement learning (RL)-based post-training, which helps align the model’s strategies with high-level reasoning goals.
Among these RL methods, Group Relative Policy Optimization (GRPO) has gained attention for its simpler design and more stable optimization compared to older approaches. However, a significant limitation of current GRPO-inspired methods is their focus primarily on the final answer. They often reward models based on correctness or superficial aspects like text length or formatting, overlooking the actual step-by-step reasoning process the model uses to arrive at an answer. This can lead to issues like overly verbose responses, speculative jumps in logic, or even accidental correctness without a true understanding of the problem.
To address this, researchers have introduced a novel framework called PM4GRPO, which stands for Reasoning-Aware Group Relative Policy Optimization using Process Mining. The core idea behind PM4GRPO is to treat the reasoning or thinking of an LRM as a process itself – a concept dubbed “Thinking is a Process” (THIP).
How PM4GRPO Works
PM4GRPO enhances the standard GRPO framework by incorporating a “conformance reward.” This reward evaluates how well the reasoning process of a policy model (the student) aligns with that of a pre-trained, high-performing teacher model. To achieve this, PM4GRPO utilizes techniques from Process Mining (PM), a field dedicated to analyzing process execution logs.
Specifically, for each problem, the policy model generates its own reasoning trace. This trace is then transformed into a process model using a technique called Inductive Miner. This student-generated process model is then compared against the teacher model’s reasoning process using Alignment-based Conformance Checking. This comparison yields two metrics: fitness (how well the student’s model explains the teacher’s process) and precision (how often actions not in the teacher’s process occur in the student’s). These two metrics are combined into an f1-score, which forms the conformance reward.
This conformance reward is then integrated with the existing reward components in GRPO-inspired methods: the format reward (ensuring the output adheres to structural requirements) and the answer reward (assessing the correctness of the final answer). This combined reward function guides the policy model during training, encouraging it to not only get the right answer but also to reason in a structured and aligned manner with an expert model.
The framework operates on a sequence-level optimization, meaning rewards are computed after the entire reasoning process for a problem is completed, rather than at each token. This is facilitated by Group Sequence Policy Optimization (GSPO), which aligns the optimization unit with the sequence-level reward structure.
Experimental Success
To evaluate PM4GRPO, experiments were conducted using 1.5B and 7B parameter backbone models. The models were trained on the DeepMath-103k dataset and evaluated on five challenging mathematical benchmarks: MATH500, OlympiadBench, Minerva, AIME24, and AIME25.
The results demonstrated that PM4GRPO consistently outperformed existing methodologies for GRPO-based post-training. For 7B models, PM4GRPO achieved the highest scores on MATH500 (91.1%) and Olympiad Bench (61.1%), surpassing strong baselines. It also showed competitive performance on Minerva Math and clear advantages on the challenging AIME24 and AIME25 benchmarks, indicating superior reasoning and generalization ability.
Similarly, for 1.5B models, PM4GRPO achieved the best overall performance across MATH500, Olympiad Bench, and Minerva Math, slightly outperforming previous strong baselines. While another model achieved higher scores on AIME24 and AIME25 for the 1.5B scale, PM4GRPO showed more balanced performance across all benchmarks, suggesting better generalization and stability even in smaller model configurations.
Also Read:
- Unlocking Complex Reasoning in LLMs with Step-wise Supervised Reinforcement Learning
- Guiding Large Language Models to Superior Reasoning with Reference Answers
Conclusion
The introduction of PM4GRPO marks a significant step forward in enhancing the reasoning capabilities of large AI models. By leveraging process mining techniques to evaluate and reward the reasoning procedure itself, rather than just the final outcome, PM4GRPO effectively improves the post-training of LRMs. These findings highlight the potential of process mining as a powerful tool for quantitatively assessing reasoning processes in the context of reinforcement learning for AI. You can find more details about this research in the full paper. Read the full research paper here.


