TLDR: CustomEnhancer is a novel zero-shot framework that significantly improves personalized human photo generation. It addresses common issues like degraded scene diversity, insufficient control, and suboptimal identity fidelity in existing text-to-image diffusion models. The framework introduces a zero-shot enhancement pipeline that leverages face swapping and pre-trained diffusion models for richer representations. It features BiMD (Bidirectionally manipulated diffusion) for unifying generation and reconstruction, and ResInversion, a new method that reduces image inversion time by 129x. CustomEnhancer also enables training-free controls for personalized models, allowing precise manipulation of both human subjects and environmental elements without retraining. Experiments show state-of-the-art results in scene diversity, identity fidelity, and efficiency, with applications in identity fusion and cartoon character identity generation.
In the rapidly evolving world of artificial intelligence, personalized photo generation has seen remarkable advancements, allowing users to create realistic images of specific individuals from text prompts. However, existing methods often struggle with generating diverse scenes, offering sufficient control over the output, and maintaining a high level of identity fidelity. A new research paper introduces a novel framework called CustomEnhancer, designed to address these very challenges and significantly boost the capabilities of current identity customization models.
CustomEnhancer is a zero-shot enhancement pipeline that acts as a plug-in for existing diffusion-based personalized models like PhotoMaker and InstantID. It aims to improve scene diversity, provide training-free controls, and enhance the perceptual identity of generated human photos. The framework achieves this through several key innovations, making the process faster and more versatile.
Enhancing Scene Diversity and Identity Fidelity
One of the core problems CustomEnhancer tackles is the degraded scene generation capability of personalized models. These models, often fine-tuned on face-centric datasets, tend to focus heavily on faces, neglecting backgrounds and bodies. CustomEnhancer leverages the power of pre-trained large-scale text-to-image diffusion models, specifically SDXL, to provide rich and diverse scene representations. It generates detailed scene images with an identity-agnostic human character, guided by text prompts. Additionally, to ensure precise identity preservation, the framework incorporates face swapping techniques. This allows for the injection of concrete perceptual facial features, such as geometric shapes and fine-scale attributes, which neural network-based extractors might miss. By fusing these scene and perceptual identity representations, CustomEnhancer enables the generation of images with complex backgrounds, detailed body features, and plausible human-context interactions, without the common “copy-paste” artifacts seen in other methods.
Faster Image Inversion with ResInversion
A crucial component in many image editing and generation workflows is the inversion process, which converts a real image back into the latent noise space of a diffusion model. Traditional methods like Null-text Inversion (NTI) are computationally intensive, especially for larger models like SDXL. CustomEnhancer introduces ResInversion, a novel and significantly faster inversion method. ResInversion performs noise rectification using a pre-diffusion mechanism, directly identifying and compensating for noise deficiencies at each step. This innovation reduces the inversion time by an impressive 129 times compared to NTI, making the entire pipeline much more efficient and reducing latency from hours to minutes.
Unified Generation with Bidirectionally Manipulated Diffusion (BiMD)
To seamlessly integrate the diverse scene and perceptual facial features with the personalized model’s customized identity representations, CustomEnhancer employs a unique approach called Bidirectionally Manipulated Diffusion (BiMD). This method unifies the generation and reconstruction processes by identifying and combining two compatible counter-directional latent spaces: a forward (generation) space and a backward (reconstruction) space. By intervening at a pivotal space of the personalized model through these complementary spaces, BiMD allows for the transfer of information from both the model’s customization capabilities and the backward reconstruction, resulting in a unified and high-quality image generation process that avoids artifacts from blending multiple models.
Training-Free Controls for Greater Flexibility
Another significant contribution of CustomEnhancer is its ability to provide comprehensive training-free control over the generation process. By integrating pre-trained SDXL’s control modules (like ControlNet for pose or Canny edge detection) into its pipeline, CustomEnhancer eliminates the need for computationally expensive retraining of control modules for each personalized model. This means users can precisely control not only the human subject (e.g., pose) but also non-primary generation targets like environmental elements, a capability often lacking in prior work. This offers controlled photorealistic personalization without the inefficiency of per-model controller retraining.
Also Read:
- SHINE: A Training-Free Framework for Realistic Image Composition
- SD3.5-Flash: Bringing High-Quality AI Image Generation to Your Devices
Real-World Applications and Performance
Experiments demonstrate that CustomEnhancer achieves state-of-the-art results in scene diversity, identity fidelity, and training-free controls. When plugged into existing models like PhotoMaker and InstantID, it significantly enhances their performance across various metrics, including face similarity and scene diversity. The framework also opens doors to novel applications such as identity fusion, allowing for the interpolation between two identities, and generating identities on specific cartoon characters, providing explicit visualization of identity transformation trajectories.
The CustomEnhancer framework represents a significant step forward in personalized photo generation, offering a robust, efficient, and highly controllable method for creating realistic human images with diverse scenes and precise identity preservation. For more technical details, you can refer to the original research paper.


