TLDR: A new research paradigm called Reverse-Engineered Reasoning (REER) allows AI models to learn deep reasoning for open-ended, creative tasks by working “backwards” from high-quality solutions to discover the underlying thought processes. This method, which is scalable and gradient-free, led to the creation of DeepWriting-20K, a dataset of 20,000 reasoning trajectories. A model trained on this data, DeepWriter-8B, demonstrates performance competitive with and sometimes superior to leading proprietary models like GPT-4o and Claude 3.5 in creative and professional writing tasks, addressing a critical challenge in AI’s ability to handle non-verifiable domains.
Large language models (LLMs) have made incredible strides in areas like mathematics and programming, where answers can be easily verified. However, when it comes to open-ended, creative tasks like writing, instilling deep reasoning remains a significant hurdle. Traditional methods, such as reinforcement learning (RL) and instruction distillation, struggle here. RL lacks clear reward signals for subjective creative quality, while distillation is expensive and limited by the capabilities of the teacher model.
A new research paper, “Reverse-Engineered Reasoning for Open-Ended Generation,” introduces a novel approach called REverse-Engineered Reasoning (REER). This paradigm fundamentally shifts how deep reasoning is taught to LLMs. Instead of trying to build a reasoning process “forwards” through trial-and-error or imitation, REER works “backwards.” It starts with known high-quality solutions and computationally discovers the step-by-step reasoning process that could have led to them. This gradient-free and scalable method bypasses the limitations of previous approaches.
The core idea behind REER is to treat the discovery of a reasoning path as a search problem. Given a good solution, REER aims to find the thinking process that makes that solution seem most logical and probable to the model. This is achieved by using the “perplexity” of the reference solution as a proxy for the quality of a given reasoning trajectory. A lower perplexity means the reasoning path better explains the high-quality output. The process involves iterative refinement, where an initial, imperfect thinking trajectory is progressively improved segment by segment, guided by this perplexity score.
Using this innovative method, the researchers curated and open-sourced a large-scale dataset called DeepWriting-20K. This dataset comprises 20,000 deep reasoning trajectories for a diverse range of open-ended tasks, covering 25 categories including creative writing, academic writing, and functional writing. The creation of DeepWriting-20K involved sourcing query-solution pairs from various platforms, synthesizing reasoning trajectories, and applying rigorous filtering to ensure high quality. A key aspect of the synthesis process was context engineering, which included enforcing segment-wise edits and injecting human-like thinking patterns (e.g., “Hmm… maybe I can…”, “Wait, that’s a bit…”) to encourage more flexible and reflective reasoning.
The model, DeepWriter-8B, was trained on this DeepWriting-20K data. The results are compelling: DeepWriter-8B not only significantly outperforms strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5. For instance, on creative tasks in HelloBench, DeepWriter-8B performed on par with GPT-4o and Claude 3.5. Remarkably, on the LongBench-Write benchmark, DeepWriter-8B even exceeded GPT-4o and Claude 3.5, suggesting that explicit training on structured thinking trajectories provides a powerful advantage for maintaining long-range coherence in ultra-long text generation.
Ablation studies further confirmed the importance of REER’s components. Removing the synthesized data led to the most significant performance drop, highlighting the critical role of high-quality, tailored reasoning trajectories. The iterative refinement process also proved crucial, as using unrefined trajectories resulted in a clear decline in performance. The inclusion of reflection tokens during synthesis was particularly beneficial for creative writing tasks, fostering flexibility and creativity. The research also found that training on creative and narrative tasks imparted a more generalizable ability to handle nuance and open-endedness, benefiting even more technical domains.
Qualitative analysis revealed that DeepWriter-8B exhibits a strong and well-rounded reasoning profile, demonstrating significant improvements in problem deconstruction, logical consistency, depth of analysis, presentation clarity, and factual grounding compared to open-source baselines. Its reasoning profile closely rivals that of GPT-4o. The injection of human-like thinking patterns during synthesis also led to a more diverse and balanced use of reasoning phrases by the model, indicating a more nuanced and reflective problem-solving approach.
Also Read:
- Navigating Complexity: How New AI Framework Guides LLMs to Smarter Reasoning
- Decoding LLM Success: How Inverse Problems Reveal the Secrets of Scaling
In conclusion, REER offers a scalable, cost-effective, and automatable “third path” for cultivating sophisticated reasoning in LLMs for open-ended, non-verifiable domains. This work, detailed further in the research paper, democratizes access to capabilities previously confined to large-scale, proprietary systems, paving the way for more powerful and scalable models in creative generation.


