spot_img
HomeResearch & DevelopmentBF-PIP: A Zero-Shot Approach to Pedestrian Intention Prediction Using...

BF-PIP: A Zero-Shot Approach to Pedestrian Intention Prediction Using Gemini 2.5 Pro

TLDR: BF-PIP is a new zero-shot AI framework that predicts pedestrian crossing intentions for autonomous vehicles. Unlike previous methods that rely on static images, BF-PIP uses Gemini 2.5 Pro to analyze short, continuous video clips combined with structured data like bounding boxes and vehicle speed. It achieves 73% accuracy without any additional training, outperforming existing models by better capturing subtle motion cues and contextual information.

Understanding when a pedestrian intends to cross the road is a critical challenge for autonomous vehicles navigating busy urban environments. Traditional methods often struggle, requiring extensive training data and frequent retraining to adapt to new situations. This is where a new approach, BF-PIP (Beyond Frames Pedestrian Intention Prediction), steps in, offering a zero-shot solution that promises more agile and reliable predictions.

BF-PIP, built upon Google’s Gemini 2.5 Pro, represents a significant leap forward because it processes short, continuous video clips directly, rather than relying on discrete, static image frames. This allows the system to capture subtle, continuous motion cues like hesitation, body shifts, and gaze changes, which are often missed by frame-based systems. The model also incorporates crucial contextual information, such as bounding-box annotations (which pinpoint the pedestrian’s location) and the ego-vehicle’s speed, all fed into the system via specialized multimodal prompts.

The core innovation lies in Gemini 2.5 Pro’s native ability to handle raw video input, enabling a deeper, temporally grounded understanding of pedestrian movement and the surrounding scene. The researchers formulated pedestrian crossing intention prediction as a binary classification task: determining whether a pedestrian will cross or not cross within a fixed future time horizon, specifically one second (30 frames) ahead, based on a 0.5-second (16-frame) observation window.

The system accepts three main types of input: a short, continuous video clip of the pedestrian, bounding box coordinates for precise localization, and the ego-vehicle’s speed to provide contextual reasoning. This multimodal data is embedded into a carefully structured prompt given to Gemini 2.5 Pro. The prompt guides the AI, setting up its role as an autonomous vehicle observer and defining the task. It even incorporates explicit reasoning steps, like analyzing posture and movement patterns, and uses a ‘role-play prompting’ strategy to enhance the model’s contextual awareness and decision-making, mimicking human-like observation.

Evaluated on the JAADbeh dataset, a widely recognized benchmark for autonomous driving research, BF-PIP achieved an impressive 73% prediction accuracy in a zero-shot setting. This performance significantly outperforms a GPT-4V baseline by 18% and even surpasses OmniPredict, a leading MLLM-based approach, by 6%. Notably, BF-PIP achieved these results without any additional training, demonstrating its strong generalization capabilities across diverse traffic scenes.

Qualitative analysis revealed that Gemini 2.5 Pro effectively interprets complex scenes, prioritizing pedestrians closer to the roadway and focusing on subtle behavioral cues such as a forward lean, gaze direction, and decisive micro-movements onto the crosswalk. An ablation study further confirmed the importance of structured visual guidance (annotations) and ego-vehicle context (speed) in enhancing prediction accuracy.

Also Read:

In conclusion, BF-PIP marks a significant advancement in pedestrian intention prediction for autonomous driving. By directly analyzing continuous video streams combined with structured metadata within a prompt-driven, zero-shot framework, it reduces the need for extensive preprocessing and retraining. This capability paves the way for more efficient and safer autonomous driving operations. For more details, you can refer to the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -