TLDR: This paper introduces Group Intention Forecasting (GIF), a novel task to predict when a group will achieve a shared goal, such as a basketball team shooting. It presents SHOT, the first large-scale dataset for GIF, featuring 1,979 basketball video clips from five camera views with detailed individual and group annotations. Additionally, the paper proposes GIFT (Group Intention ForecasTer), a framework utilizing spatio-temporal graph convolutional networks to analyze player interactions and forecast group intention emergence earlier and more accurately than previous methods, laying a foundation for future research in collective behavior prediction.
Understanding what individuals intend to do has been a significant area of research in artificial intelligence, with applications ranging from human-computer interaction to security systems. However, the complexities of collective intentions within groups have largely been overlooked. This narrow focus limits real-world applicability, where group-level dynamics and multi-agent interactions are crucial for a complete understanding.
To bridge this gap, researchers have introduced a novel concept: group intention. This refers to shared goals that emerge through the coordinated actions of multiple individuals. Building on this, they propose a new task called Group Intention Forecasting (GIF). The goal of GIF is to predict when these group intentions will occur by analyzing individual actions and interactions at an early stage, before the collective goal becomes fully apparent.
Consider a basketball game: individual players might be running, defending, or holding the ball. Each action reflects an individual intention. However, their coordinated efforts to score a point embody the team’s group intention. For a defense team, anticipating when the offense will shoot is critical for disrupting their play. GIF aims to provide this early foresight, enabling timely decisions in various domains, including sports strategy, public safety, and intelligent systems.
Introducing the SHOT Dataset
A major challenge in developing GIF has been the lack of suitable datasets. Existing datasets for intention recognition focus on individuals, while group activity datasets emphasize explicit, already-happening actions rather than emerging intentions. To address this, the researchers developed SHOT, the first large-scale dataset specifically designed for GIF. SHOT comprises 1,979 basketball video clips, captured from five different camera views, and is extensively annotated with six types of individual attributes.
SHOT’s design incorporates three key characteristics essential for studying emerging group intentions:
- Multi-Individual Information: It captures fine-grained cues such as bounding boxes, poses, gaze directions, head orientations, roles, and velocities for early-stage analysis. These detailed features are crucial because group intentions are not fully revealed in their initial stages.
- Multi-View Adaptability: Multiple camera views help overcome occlusion issues common in single-view datasets, ensuring that critical player actions are observable from different angles. This leads to more robust and accurate analysis of player intentions.
- Multi-Level Intention: The dataset captures both individual behaviors (like running or holding) and their coordination towards a group goal (like shooting), represented through detailed role annotations.
The SHOT dataset boasts an impressive scale, totaling over two hours of footage and featuring approximately 2.1 million player positions, pose instances, and velocity data, along with millions of other annotations. This rich data provides a comprehensive foundation for understanding complex group dynamics.
Also Read:
- Unifying AI’s Perception and Action Through Embodied Representation
- Learning by Watching: A Deep Dive into State-Only Imitation for AI Agents
The GIFT Method for Forecasting
To effectively forecast the timing of group intentions, the researchers also propose GIFT (Group Intention ForecasTer), a novel framework. GIFT employs an encoder-decoder architecture based on Spatio-Temporal Graph Convolutional Networks (STGCNs). This framework is designed to extract heterogeneous player features from observed video frames and model the evolving group dynamics.
GIFT works by analyzing a sequence of early frames to forecast each player’s features in future, unseen frames. By identifying when a specific role, such as ‘shooting,’ is predicted to occur among the players, the system can determine the precise timing of the group intention. The model is trained to reconstruct observed frames and accurately predict future ones, balancing these objectives to achieve high precision.
Experimental results demonstrate the effectiveness of both the SHOT dataset and the GIFT framework. Compared to traditional temporal action localization models, GIFT significantly improves the accuracy of shooting time predictions, achieving a lower Mean Absolute Error (MAE). While forecasting group intentions early is inherently more challenging, the value of early intervention in group contexts, where decision-making costs grow nonlinearly with time, makes this approach highly significant.
This research marks a significant step forward in understanding and predicting collective behaviors. By introducing the GIF task, the SHOT dataset, and the GIFT framework, the authors provide a strong foundation for future research in group intention forecasting, extending beyond individual actions to the complex, coordinated goals of groups. You can find more details about this work in the research paper.


