Unveiling AI's Creative Gaze: How Artists Explore Video Diffusion's Inner Workings

TLDR: This paper introduces a tool for artists to visualize attention maps in video diffusion models, specifically using the open-source Wan model. By showing how text prompts influence generated video regions, it offers an interpretable window into AI’s creative process, enabling artists to understand and manipulate these internal mechanisms for new forms of video art, a concept termed “network bending.” The research highlights how artists can gain creative leverage by exploring the internal mechanics of AI video models, moving beyond traditional prompt engineering to intervene directly in the generation process.

In the evolving landscape of artificial intelligence, a new frontier is emerging where artists are not just users of AI tools but active explorers of their inner workings. A recent paper, “Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts,” delves into this exciting intersection, offering a unique perspective on how generative AI models create video content and how artists can leverage this understanding for creative expression.

The research, conducted by Adam Cole and Mick Grierson from the University of the Arts London, draws inspiration from early video artists who manipulated analog signals to craft novel visual aesthetics. Today, with the rise of sophisticated AI video models, Cole and Grierson ask if a similar approach can be applied – one that uses technical insight to expand the creative possibilities of these new digital systems.

At the heart of their investigation are “attention maps” within video diffusion transformers. These maps are essentially a window into the AI’s “thought process,” revealing which parts of a text prompt (like a word or phrase) influence specific regions of the generated video over time. Imagine telling an AI to create a video of “a cat playing with a soccer ball,” and then being able to see exactly how the word “cat” directs the AI to form the feline, and “soccer ball” guides the creation of the ball. This level of transparency allows artists to “see what the model sees.”

The researchers built a specialized tool based on the open-source Wan video model. This tool has two main components: extraction and visualization. During video generation, it intercepts and stores the cross-attention computations. These raw data points are then reshaped and upscaled to match the video’s dimensions, visualized as heatmaps where brighter colors indicate stronger attention. This allows users to examine attention at various levels, from individual attention heads and model layers to overall averages across the generation process.

Through a series of “exploratory probes,” the team confirmed the effectiveness of their visualization method. For instance, when prompted with “a cat,” the attention maps clearly highlighted the cat’s region in the video. In a more complex scenario involving “cat,” “soccer ball,” and “Eiffel Tower,” each token’s attention localized accurately on its corresponding object. Even for abstract concepts like a “classic Hollywood kiss,” the “kiss” token’s attention maps clustered meaningfully around the subjects’ lips, demonstrating how even nuanced ideas manifest within the model’s internal space.

A key artistic outcome of this research is the video study titled “Attention of a Kiss.” This piece visualizes the evolving attention map of the “kiss” token throughout the video generation timeline. It begins abstractly and gradually gains structure, mirroring both the AI’s diffusion process and the development of emotional intimacy. This metaphorical alignment suggests new narrative forms that are deeply rooted in the mechanics of AI.

For artists, visualizing these attention maps offers a powerful way to understand how their textual prompts translate into visual outcomes. It helps them develop an intuitive grasp of how language is interpreted by the AI, identify recurring visual motifs, and ultimately craft prompts with greater intentionality. This feedback loop between creative intent and model behavior opens up new avenues for deliberate experimentation.

While attention maps provide valuable insights, the paper also acknowledges limitations. Some maps can be noisy or inconsistent, especially for abstract prompts, and with longer prompts, multiple tokens can overlap, making isolation difficult. Furthermore, generating and analyzing these maps is resource-intensive. Future work aims to streamline this process and develop higher-level visualizations.

Also Read:

The paper concludes by drawing a parallel to early video artists who “network bent” analog systems. Today, artists can similarly engage in “network bending” by exploring the internal logic of AI models. By treating the neural network itself as a malleable medium, artists can move beyond mere prompt engineering to creatively intervene in the generation process, producing outputs that transcend the model’s intended domain. This approach extends the legacy of experimental video art into the realm of generative AI, where the artwork emerges not just from what is seen, but from how the network sees. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unveiling AI’s Creative Gaze: How Artists Explore Video Diffusion’s Inner Workings

Gen AI News and Updates

Obello Secures $9.5 Million to Revolutionize Brand Creative Scaling with AI

TrueBalance Transforms Indian Credit Landscape with Advanced AI for Financial Inclusion

iQiyi Concludes Inaugural Global AI Short Film Competition, Recognizing 11 Visionary Creators

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates