TLDR: A new method called DGPST (Domain Generalizable Portrait Style Transfer) is introduced, which uses a dual-conditional diffusion model to achieve high-quality, semantic-aware style transfer between any two portraits, even across different domains like photos, cartoons, and sketches. It establishes dense semantic correspondence and employs an AdaIN-Wavelet transform to balance content preservation and stylization, outperforming previous methods in visual quality and quantitative metrics.
Portrait style transfer, a fascinating area in image editing, allows us to apply the visual characteristics of one portrait to another. Imagine transforming a regular photo into a cartoon, modernizing an old family picture, or adding color to a sketch. While this sounds straightforward, it’s quite challenging because it requires precise adjustments to different facial regions like skin, lips, eyes, hair, and background, all while keeping the person’s identity and facial structure intact.
Introducing a New Era of Portrait Style Transfer
A recent research paper titled “Domain Generalizable Portrait Style Transfer” introduces a novel method that significantly advances this field. Developed by Xinbo Wang, Wenju Xu, Qing Zhang, and Wei-Shi Zheng, this approach tackles the limitations of previous methods, which often struggled to adapt to portraits from diverse domains or maintain semantic alignment across different facial structures.
The core innovation lies in its ability to generalize across various domains, meaning it can seamlessly transfer styles between a wide range of portraits, including photos, cartoons, sketches, and animations. This is achieved even when trained on a relatively small dataset of 30,000 portrait photos.
How Does It Work?
The method, referred to as DGPST, employs a sophisticated framework built upon a dual-conditional diffusion model. Here’s a simplified breakdown of its key components:
-
Semantic-Aware Style Alignment: Unlike many existing methods, DGPST focuses on establishing a dense semantic correspondence between the input portrait (the one you want to style) and the reference portrait (the one providing the style). This is done using a pre-trained model and a specialized semantic adapter. This step essentially “warps” the reference portrait so its features, like eyes aligning with eyes and hair with hair, match the input portrait. This ensures that the style is transferred accurately to the correct facial regions.
-
AdaIN-Wavelet Transform for Latent Initialization: Color tone is crucial for defining artistic style. If the process starts from the input image’s original color, it tends to retain that color, limiting the style transfer effect. To overcome this, the researchers devised an AdaIN-Wavelet transform. This technique blends the low-frequency (overall color and smooth features) information from the warped reference with the high-frequency (sharp details) information from the input. This clever blend ensures a balance between adopting the new style’s colors and preserving the original portrait’s fine details, preventing blurry results.
-
Dual-Conditional Diffusion Model: The final generation process uses a powerful diffusion model that takes two types of guidance: structure and style. A component called ControlNet extracts high-frequency details from the input image, providing structural guidance to maintain the portrait’s form. Simultaneously, a style adapter uses the warped reference to provide style guidance, ensuring the artistic elements are accurately applied. This dual guidance system helps create realistic and visually coherent stylized portraits.
Also Read:
- Unlocking Data Groupings with Diffusion Models: Introducing CLUDI
- Unlocking Image Generation Potential with Cloud Diffusion Models
Impressive Results and Versatility
Extensive experiments demonstrate that this new method significantly outperforms previous state-of-the-art techniques. It achieves superior visual quality, better content preservation, and more accurate identity preservation. The method’s domain generalizability is particularly noteworthy, as it performs exceptionally well on mixed datasets containing portraits from various sources.
Beyond general style transfer, DGPST offers remarkable versatility:
-
Controllable Region-Specific Style Transfer: Users can choose to apply style transfer to specific facial regions, such as just the hair, face, or lips, allowing for fine-grained control over the output.
-
Style Interpolation: The method supports continuous style interpolation, enabling users to adjust the strength of the stylization, from subtle changes to dramatic transformations.
-
Cross-Domain Applications: It can effectively colorize grayscale and sketch portraits using colored reference images and even modernize old photographs by restoring their color and style.
Furthermore, the model is efficient, processing 512×512 resolution images in approximately 6.97 seconds on an NVIDIA RTX 4090 GPU, making it faster than many other diffusion-based methods.
This research marks a significant step forward in portrait style transfer, offering a robust, high-quality, and highly generalizable solution for transforming portraits across diverse artistic domains. For more technical details, you can refer to the full research paper here.


