TLDR: This paper introduces a tool for artists to visualize attention maps in video diffusion models, specifically using the open-source Wan model. By showing how text prompts influence generated video regions, it offers an interpretable window into AI’s creative process, enabling artists to understand and manipulate these internal mechanisms for new forms of video art, a concept termed “network bending.” The research highlights how artists can gain creative leverage by exploring the internal mechanics of AI video models, moving beyond traditional prompt engineering to intervene directly in the generation process.
In the evolving landscape of artificial intelligence, a new frontier is emerging where artists are not just users of AI tools but active explorers of their inner workings. A recent paper, “Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts,” delves into this exciting intersection, offering a unique perspective on how generative AI models create video content and how artists can leverage this understanding for creative expression.
The research, conducted by Adam Cole and Mick Grierson from the University of the Arts London, draws inspiration from early video artists who manipulated analog signals to craft novel visual aesthetics. Today, with the rise of sophisticated AI video models, Cole and Grierson ask if a similar approach can be applied – one that uses technical insight to expand the creative possibilities of these new digital systems.
At the heart of their investigation are “attention maps” within video diffusion transformers. These maps are essentially a window into the AI’s “thought process,” revealing which parts of a text prompt (like a word or phrase) influence specific regions of the generated video over time. Imagine telling an AI to create a video of “a cat playing with a soccer ball,” and then being able to see exactly how the word “cat” directs the AI to form the feline, and “soccer ball” guides the creation of the ball. This level of transparency allows artists to “see what the model sees.”
The researchers built a specialized tool based on the open-source Wan video model. This tool has two main components: extraction and visualization. During video generation, it intercepts and stores the cross-attention computations. These raw data points are then reshaped and upscaled to match the video’s dimensions, visualized as heatmaps where brighter colors indicate stronger attention. This allows users to examine attention at various levels, from individual attention heads and model layers to overall averages across the generation process.
Through a series of “exploratory probes,” the team confirmed the effectiveness of their visualization method. For instance, when prompted with “a cat,” the attention maps clearly highlighted the cat’s region in the video. In a more complex scenario involving “cat,” “soccer ball,” and “Eiffel Tower,” each token’s attention localized accurately on its corresponding object. Even for abstract concepts like a “classic Hollywood kiss,” the “kiss” token’s attention maps clustered meaningfully around the subjects’ lips, demonstrating how even nuanced ideas manifest within the model’s internal space.
A key artistic outcome of this research is the video study titled “Attention of a Kiss.” This piece visualizes the evolving attention map of the “kiss” token throughout the video generation timeline. It begins abstractly and gradually gains structure, mirroring both the AI’s diffusion process and the development of emotional intimacy. This metaphorical alignment suggests new narrative forms that are deeply rooted in the mechanics of AI.
For artists, visualizing these attention maps offers a powerful way to understand how their textual prompts translate into visual outcomes. It helps them develop an intuitive grasp of how language is interpreted by the AI, identify recurring visual motifs, and ultimately craft prompts with greater intentionality. This feedback loop between creative intent and model behavior opens up new avenues for deliberate experimentation.
While attention maps provide valuable insights, the paper also acknowledges limitations. Some maps can be noisy or inconsistent, especially for abstract prompts, and with longer prompts, multiple tokens can overlap, making isolation difficult. Furthermore, generating and analyzing these maps is resource-intensive. Future work aims to streamline this process and develop higher-level visualizations.
Also Read:
- Artificial Intelligence and Traditional Art: Navigating the Crossroads of Innovation and Heritage
- Enhancing Multimodal Models with Reconstruction Alignment
The paper concludes by drawing a parallel to early video artists who “network bent” analog systems. Today, artists can similarly engage in “network bending” by exploring the internal logic of AI models. By treating the neural network itself as a malleable medium, artists can move beyond mere prompt engineering to creatively intervene in the generation process, producing outputs that transcend the model’s intended domain. This approach extends the legacy of experimental video art into the realm of generative AI, where the artwork emerges not just from what is seen, but from how the network sees. You can read the full research paper here.


