TLDR: SHERLOCK is a novel framework that uses large language models (LLMs) to improve e-commerce risk management. It addresses challenges like high workload, inconsistent judgments, and evolving fraud patterns by integrating a dynamic Domain Knowledge Base, a Data Flywheel for continuous learning, and a Reflect & Refine (R&R) module. Experiments on JD.com data show SHERLOCK significantly boosts the precision of risk assessments, reduces LLM hallucinations, and drastically improves operational efficiency and expert trust, enabling faster and more accurate fraud detection.
The rapid expansion of e-commerce has brought with it a significant challenge: the escalating battle against sophisticated fraud and shadow economy activities. Risk management teams are constantly overwhelmed by the sheer volume of suspicious cases, each demanding meticulous investigation and deep expert knowledge. This intensive manual process leads to substantial workloads for analysts, inconsistencies in judgment, and slow adaptation to new fraud patterns.
To address these critical issues, researchers have introduced SHERLOCK, an innovative framework designed to enhance e-commerce risk management by leveraging the advanced reasoning capabilities of large language models (LLMs). SHERLOCK aims to provide dynamic knowledge adaptation, making risk investigations more efficient, accurate, and consistent.
Understanding SHERLOCK’s Core Components
The SHERLOCK framework is built upon three primary components that work together to create a robust and adaptive system:
1. Domain Knowledge Base (KB): This is the brain of the system, a comprehensive repository of risk management knowledge. It’s constructed by extracting valuable insights from various sources, including business documents, meeting recordings, and even code repositories. This multi-modal data is transformed into structured knowledge covering specialized terminology (e.g., clarifying specific platform services), complex business logic (e.g., understanding why certain transaction patterns are normal in specific contexts), and evolving risk patterns (e.g., identifying new fraud indicators and their thresholds).
2. Data Flywheel: This component establishes a continuous learning loop, integrating daily operations, expert feedback, and model evaluations. It’s designed to efficiently generate high-quality training data for LLMs at minimal cost. When LLM-generated conclusions don’t meet expectations, these cases are prioritized for expert annotation, ensuring that resources are focused on the most informative samples. A unique “selection-over-creation” strategy simplifies annotation, where experts select and refine LLM-generated insights rather than creating them from scratch. Furthermore, a “suspect-then-rule-out” framework guides the LLM to simulate expert reasoning, improving its analytical processes over time.
3. Reflect & Refine (R&R) Module: This module acts as a critical post-analysis inspection layer. Its main functions are to mitigate hallucinations (incorrect or fabricated information) by the LLM and to enable rapid adaptation to emerging risk patterns. It achieves this by retrieving relevant knowledge from the Domain KB to fact-check and refine the LLM’s initial risk assessments. The R&R module also supports real-time “hotfixes,” allowing for immediate updates to the knowledge base with new business logic or policy adjustments without requiring a full model retraining, ensuring the system remains agile in a dynamic environment.
Real-World Impact and Performance
Experiments conducted on a real-world transaction dataset from JD.com demonstrated SHERLOCK’s significant impact. The framework substantially improved the precision of LLM analysis results, both in factual alignment and in accurately pinpointing risks. Compared to traditional methods and even powerful general-purpose LLMs, SHERLOCK showed a dramatic increase in its Signal-to-Noise Ratio (SNR), indicating a much higher proportion of relevant risk factors to irrelevant ones.
Human experts overwhelmingly preferred SHERLOCK’s output, rating it highly for trustworthiness and helpfulness in decision-making. In live A/B tests on JD.com’s platform, the deployment of the SHERLOCK-based LLM system led to remarkable improvements in operational efficiency. Risk managers were able to make decisions 387% faster, and the expert acceptance rate of the LLM’s recommendations soared to 82%, signifying a massive increase in trust and reliability.
Also Read:
- Enhancing E-commerce Recommendations for Infrequent Shoppers with AI User Grouping
- CLARITY: Enhancing LLM Reasoning Quality Through Consistency-Aware Reinforcement Learning
A Step Towards Adaptive Risk Management
SHERLOCK represents a significant advancement in applying LLMs to complex, domain-specific challenges like e-commerce risk management. By combining structured domain knowledge, continuous learning, and a reflective refinement process, it creates an evolvable architecture capable of adapting to the ever-changing landscape of online fraud. This framework not only enhances the accuracy and interpretability of risk assessments but also empowers human experts with more efficient and reliable tools. For more details, you can refer to the original research paper.


