Automating Suspect Sketching with Generative AI

TLDR: A research project explored three AI models for generating police sketches from text descriptions and initial sketches. While a novel approach using LoRA fine-tuned CLIP was developed, the baseline Stable Diffusion model surprisingly achieved the best performance in terms of structural and perceptual similarity, and clarity of facial features, highlighting its robustness. The study also confirmed that fine-tuning both self- and cross-attention layers in CLIP improved text-image alignment.

In the realm of law enforcement, generating accurate police sketches is a crucial task, especially when photographic evidence is unavailable. Traditionally, this process relies on manual artistry, which can be both time-consuming and inconsistent. However, recent advancements in Artificial Intelligence (AI) are paving the way for automating and enhancing this vital process, making it more efficient and reliable.

A recent research project, titled Gen-AI Police Sketches with Stable Diffusion, delves into the use of multimodal AI-driven approaches to automate and improve suspect sketching. The researchers, Aaron Contreras, Nico Fidalgo, Katherine Harvey, and Johnny Ni from Harvard College, developed and evaluated three distinct AI pipelines.

The Three AI Approaches

The project investigated three Stable Diffusion models, each tasked with generating police sketches from multimodal inputs, which include both text descriptions and initial sketches:

1. Baseline Stable Diffusion Model: This foundational model (specifically, runwayml/stable-diffusion-v1-5) directly generates sketches from an input sketch. It serves as a robust starting point for comparison.

2. Stable Diffusion with Pre-trained CLIP: This approach integrates a pre-trained CLIP (Contrastive Language–Image Pre-training) model (openai/clip-vit-base-patch32) with the Stable Diffusion model. The goal here is to enhance the alignment between text descriptions and the generated images, improving semantic accuracy.

3. Novel Approach with Fine-tuned CLIP: This is the project’s most innovative contribution. It involves fine-tuning the CLIP model using a technique called LoRA (Low-Rank Adaptation). The fine-tuning specifically targets both the self-attention and cross-attention layers of the CLIP model. This allows the model to better capture nuanced relationships between text descriptions and sketches, and it is then integrated into the Stable Diffusion pipeline.

Dataset and Iterative Refinement

To train and evaluate these models, the researchers curated a dataset of 295 (description, sketch) pairs from the CUHK Face Sketch FERET Database (CUFSF). To ensure consistency, structured text descriptions were generated using ChatGPT-4, following a template format like “The suspect is described as [demographic] with [physical attributes]…”.

A key feature across all models is the capability for iterative refinement. This allows users to dynamically improve sketches over successive iterations by updating embeddings from text and image inputs and adjusting prompts. This process aims to enhance usability and accuracy, streamlining the sketch generation process.

Key Findings and Performance

An important part of the research involved an ablation study to determine which layers of the CLIP model to fine-tune. The study confirmed that fine-tuning both self-attention and cross-attention layers yielded the best visual quality and alignment between text descriptions and image features.

Despite the complexity of the novel approaches, performance testing revealed some interesting results. The simplest model, Model 1 (the baseline Stable Diffusion model), achieved the highest structural similarity (SSIM) of 0.72 and a peak signal-to-noise ratio (PSNR) of 25 dB. These metrics indicate superior alignment with ground truth images and reduced distortion, respectively. Model 1 also maintained the highest CLIP score, demonstrating strong text-image alignment, and consistently achieved the lowest LPIPS (Learned Perceptual Image Patch Similarity) values, indicating closer perceptual resemblance to ground truth images.

While Model 3, the novel approach with fine-tuned CLIP, showed some improvements over Model 2 in CLIP scores and LPIPS, it still trailed behind Model 1 across most metrics. Qualitatively, sketches generated by Model 1 demonstrated the clearest facial features, highlighting its robustness as a baseline despite its simplicity.

Also Read:

Conclusion and Future Directions

This project makes a significant contribution by presenting a novel AI-driven approach for police sketch generation, leveraging multimodal inputs and iterative refinement. It offers a promising alternative to traditional manual sketch artistry, addressing limitations in efficiency and consistency.

The study underscored the importance of balancing model complexity with performance and the crucial role of data consistency. While the baseline Model 1 currently outperforms the others, the iterative refinement step shows promise for Model 3, suggesting that with further optimization, it could surpass Model 1 in specific applications. Future work includes expanding the dataset, investigating limitations imposed by the 77-token input restriction in CLIP models (which can hinder capturing nuanced facial differences), and exploring further iterative refinement strategies, such as using masking to focus on specific facial features.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automating Suspect Sketching with Generative AI

The Three AI Approaches

Dataset and Iterative Refinement

Key Findings and Performance

Conclusion and Future Directions

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates