spot_img
HomeResearch & DevelopmentPrecise Control Over Human-Object Interactions in Image Generation

Precise Control Over Human-Object Interactions in Image Generation

TLDR: Interact-Custom introduces a new task, Customized Human Object Interaction Image Generation (CHOI), which focuses on generating images where specific humans and objects interact in user-defined ways while preserving their identities. The paper addresses challenges like data scarcity for disentangling identity and pose features, and ensuring correct spatial configurations for interactions. It proposes a two-stage model: Interaction-Aware Mask Generation (IAMG) to create interaction masks, and Mask Guided Image Generation (MGIG) to synthesize images using these masks and extracted identity features. A large-scale dataset of human-object pairs with varying poses was also collected. Experiments show Interact-Custom significantly improves identity preservation and interaction semantic control compared to previous methods.

Generating realistic images where specific humans and objects interact in precise ways has long been a complex challenge in the field of artificial intelligence. While existing methods have made strides in customizing individual subjects within images, they often fall short when it comes to controlling the intricate dance of interaction between multiple entities. This is where a new research paper, “Interact-Custom: Customized Human Object Interaction Image Generation”, steps in, introducing a novel approach to tackle this very problem.

The Challenge of Customized Human-Object Interaction

The authors, Zhu Xu, Zhaowen Wang, Yuxin Peng, and Yang Liu, highlight a significant gap in current compositional image generation techniques. These methods excel at preserving the appearance of target subjects but struggle with fine-grained interaction control. Imagine trying to generate an image of a specific person feeding a particular dog – current models might generate the person and dog, but the feeding action itself, including the correct spatial arrangement of hands and mouth, often goes awry. This leads to a semantic mismatch between the desired interaction and the generated image.

To address this, the researchers propose a new task: Customized Human Object Interaction Image Generation (CHOI). This task demands two things simultaneously: maintaining the unique identities of both the human and the object, and precisely controlling the semantic interaction between them.

Two primary challenges stand in the way of achieving CHOI:

1. Data Scarcity: To effectively control interactions, a model needs to understand how to separate a subject’s inherent identity features (what makes them unique) from their pose-oriented interaction features (how they move and interact). Existing datasets for human-object interaction typically show static scenes, making it difficult for models to learn this crucial separation.

2. Spatial Configuration: Even if a model understands individual identities and poses, the way a human and object are positioned relative to each other is critical for conveying a specific interaction. An incorrect distance or alignment between body parts can completely alter the perceived action, as seen in the example of a human feeding a dog where the hand and mouth need to be in close proximity.

Interact-Custom: A Two-Stage Solution

To overcome these hurdles, the researchers developed a two-pronged solution:

1. A Tailored Large-Scale Dataset: Recognizing the limitations of existing data, the team collected and processed a new, extensive dataset. This dataset is unique because it contains samples of the same human-object pair engaging in different interactive poses. By sourcing data from both images and videos, they created approximately 1 million samples covering a wide array of interaction categories and object types. This rich dataset is specifically designed to help models learn how to disentangle identity features from interaction-specific pose features for both humans and objects.

2. The Interact-Custom Model: This innovative model operates in two stages:

  • Interaction-Aware Mask Generation (IAMG): In the first stage, a diffusion model is used to generate a foreground mask. This mask explicitly outlines the spatial configuration of the human and object as they interact, guided by a text prompt describing the desired action. This ensures that the generated interaction has an appropriate and semantically correct spatial layout. The model can also optionally incorporate a background image and a bounding box to specify where the interaction should occur within a scene.
  • Mask Guided Image Generation (MGIG): The second stage takes the mask generated by IAMG and uses it as a guide to synthesize the final image. It extracts identity features from the input human and object images to ensure their appearance is faithfully preserved. Simultaneously, the generated mask directs the human and object to adopt the correct poses and spatial configuration for the specified interaction. This stage also supports the optional integration of a custom background and precise location for the interaction, allowing for high content controllability.

Demonstrated Effectiveness

Extensive experiments were conducted using specially designed metrics to evaluate Interact-Custom’s performance. The results show significant improvements over existing compositional customization and interaction control approaches. For instance, Interact-Custom achieved the highest scores in preserving both human and object identities, indicating that the generated images closely resemble the original subjects. It also demonstrated superior ability in controlling interaction semantics, accurately depicting the desired actions and spatial relationships between humans and objects.

A user study further validated these quantitative findings, with participants consistently rating Interact-Custom’s generated images higher across all metrics, including human and object appearance, background quality, and the accuracy of interaction semantics. The model’s ability to seamlessly integrate interacting subjects into specified backgrounds and locations was also highly praised.

Also Read:

Looking Ahead

The introduction of the CHOI task and the Interact-Custom model marks a significant step forward in generative AI. By providing a robust framework for customized human-object interaction image generation, this research opens doors for more realistic and controllable content creation in various applications, from advertising to virtual reality. While there’s always room for further refinement, Interact-Custom sets a new benchmark for generating images that not only look real but also accurately convey complex human-object interactions.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -