TLDR: FocusDPO is a new AI framework that significantly enhances personalized image generation, particularly for images containing multiple subjects. It uses a dynamic attention mechanism to adaptively focus on critical regions of an image during training, based on semantic complexity and detail preservation. This approach effectively prevents subjects from blending together (attribute leakage) and maintains their individual fidelity, leading to higher quality and more consistent generated images across various scenarios.
Creating personalized images with artificial intelligence has seen remarkable progress, especially with the rise of diffusion models. These models can now generate high-quality images featuring specific subjects. However, a significant challenge remains when trying to generate images with multiple distinct subjects while maintaining their individual characteristics without them blending together or losing detail. This is where a new framework called FocusDPO steps in.
FocusDPO, which stands for Dynamic Preference Optimization, is designed to tackle the complexities of multi-subject personalized image generation. The core problem it addresses is the difficulty in achieving fine-grained, independent control over multiple subjects. Existing methods often struggle with ‘cross-subject attribute leakage,’ where features from one subject inadvertently influence another, leading to inconsistent or corrupted images. Additionally, preserving the precise details of each subject becomes harder as more subjects are introduced, especially if they share similar visual traits.
The key innovation of FocusDPO lies in its adaptive focus mechanism. Unlike previous approaches that apply uniform optimization across an entire image, FocusDPO intelligently identifies and prioritizes ‘focus regions’ during the training process. These regions are characterized by high semantic complexity and areas where preserving fine details is crucial. By dynamically adjusting these focal areas across different noise levels during image generation, the model can concentrate its learning resources on the most challenging parts of the image.
The framework employs a weighted strategy that rewards information-rich patches while penalizing regions with low prediction confidence. This dynamic adjustment of focus is based on the semantic complexity of the reference images and helps establish robust correspondence mappings between the generated and original subjects. This means the model learns to keep each subject’s identity consistent, even in diverse and complex generation scenarios.
FocusDPO introduces two main components to achieve this adaptive attention: a **Structure-Preserving Attention Field** and a **Detail-Preserving Complexity Estimator**. The Structure-Preserving Attention Field helps to prevent subject confusion by focusing on semantic relationships between the generated image and the reference images. The Detail-Preserving Complexity Estimator, on the other hand, identifies regions of high visual complexity (like intricate textures or facial details) and prioritizes them during optimization. This ensures that fine-grained details are accurately preserved.
To train this system effectively, the researchers developed a unique dataset called the Disrupted-Instance Pair (DIP) Dataset. This dataset consists of semantically aligned positive and negative image pairs. Positive samples maintain strong subject identity, while negative samples are created by introducing controlled semantic disruptions to subject regions, ensuring the model learns to distinguish between consistent and inconsistent generations.
Extensive experiments have shown that FocusDPO significantly enhances the performance of existing personalized generation models. It achieves state-of-the-art results on both single-subject and multi-subject personalized image synthesis benchmarks. The method effectively mitigates attribute leakage and preserves superior subject fidelity across various generation scenarios, marking a significant advancement in controllable multi-subject image synthesis.
Also Read:
- O-DisCo-Edit: Achieving Versatile Video Editing with Unified Object Control
- Enhancing Omni-Modal Language Models: A New Framework to Combat Hallucinations
For those interested in the technical details, the full research paper can be found here.


