spot_img
HomeResearch & DevelopmentAutomating Suspect Sketching with Generative AI

Automating Suspect Sketching with Generative AI

TLDR: A research project explored three AI models for generating police sketches from text descriptions and initial sketches. While a novel approach using LoRA fine-tuned CLIP was developed, the baseline Stable Diffusion model surprisingly achieved the best performance in terms of structural and perceptual similarity, and clarity of facial features, highlighting its robustness. The study also confirmed that fine-tuning both self- and cross-attention layers in CLIP improved text-image alignment.

In the realm of law enforcement, generating accurate police sketches is a crucial task, especially when photographic evidence is unavailable. Traditionally, this process relies on manual artistry, which can be both time-consuming and inconsistent. However, recent advancements in Artificial Intelligence (AI) are paving the way for automating and enhancing this vital process, making it more efficient and reliable.

A recent research project, titled Gen-AI Police Sketches with Stable Diffusion, delves into the use of multimodal AI-driven approaches to automate and improve suspect sketching. The researchers, Aaron Contreras, Nico Fidalgo, Katherine Harvey, and Johnny Ni from Harvard College, developed and evaluated three distinct AI pipelines.

The Three AI Approaches

The project investigated three Stable Diffusion models, each tasked with generating police sketches from multimodal inputs, which include both text descriptions and initial sketches:

1. Baseline Stable Diffusion Model: This foundational model (specifically, runwayml/stable-diffusion-v1-5) directly generates sketches from an input sketch. It serves as a robust starting point for comparison.

2. Stable Diffusion with Pre-trained CLIP: This approach integrates a pre-trained CLIP (Contrastive Language–Image Pre-training) model (openai/clip-vit-base-patch32) with the Stable Diffusion model. The goal here is to enhance the alignment between text descriptions and the generated images, improving semantic accuracy.

3. Novel Approach with Fine-tuned CLIP: This is the project’s most innovative contribution. It involves fine-tuning the CLIP model using a technique called LoRA (Low-Rank Adaptation). The fine-tuning specifically targets both the self-attention and cross-attention layers of the CLIP model. This allows the model to better capture nuanced relationships between text descriptions and sketches, and it is then integrated into the Stable Diffusion pipeline.

Dataset and Iterative Refinement

To train and evaluate these models, the researchers curated a dataset of 295 (description, sketch) pairs from the CUHK Face Sketch FERET Database (CUFSF). To ensure consistency, structured text descriptions were generated using ChatGPT-4, following a template format like “The suspect is described as [demographic] with [physical attributes]…”.

A key feature across all models is the capability for iterative refinement. This allows users to dynamically improve sketches over successive iterations by updating embeddings from text and image inputs and adjusting prompts. This process aims to enhance usability and accuracy, streamlining the sketch generation process.

Key Findings and Performance

An important part of the research involved an ablation study to determine which layers of the CLIP model to fine-tune. The study confirmed that fine-tuning both self-attention and cross-attention layers yielded the best visual quality and alignment between text descriptions and image features.

Despite the complexity of the novel approaches, performance testing revealed some interesting results. The simplest model, Model 1 (the baseline Stable Diffusion model), achieved the highest structural similarity (SSIM) of 0.72 and a peak signal-to-noise ratio (PSNR) of 25 dB. These metrics indicate superior alignment with ground truth images and reduced distortion, respectively. Model 1 also maintained the highest CLIP score, demonstrating strong text-image alignment, and consistently achieved the lowest LPIPS (Learned Perceptual Image Patch Similarity) values, indicating closer perceptual resemblance to ground truth images.

While Model 3, the novel approach with fine-tuned CLIP, showed some improvements over Model 2 in CLIP scores and LPIPS, it still trailed behind Model 1 across most metrics. Qualitatively, sketches generated by Model 1 demonstrated the clearest facial features, highlighting its robustness as a baseline despite its simplicity.

Also Read:

Conclusion and Future Directions

This project makes a significant contribution by presenting a novel AI-driven approach for police sketch generation, leveraging multimodal inputs and iterative refinement. It offers a promising alternative to traditional manual sketch artistry, addressing limitations in efficiency and consistency.

The study underscored the importance of balancing model complexity with performance and the crucial role of data consistency. While the baseline Model 1 currently outperforms the others, the iterative refinement step shows promise for Model 3, suggesting that with further optimization, it could surpass Model 1 in specific applications. Future work includes expanding the dataset, investigating limitations imposed by the 77-token input restriction in CLIP models (which can hinder capturing nuanced facial differences), and exploring further iterative refinement strategies, such as using masking to focus on specific facial features.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -