BF-PIP: A Zero-Shot Approach to Pedestrian Intention Prediction Using Gemini 2.5 Pro

TLDR: BF-PIP is a new zero-shot AI framework that predicts pedestrian crossing intentions for autonomous vehicles. Unlike previous methods that rely on static images, BF-PIP uses Gemini 2.5 Pro to analyze short, continuous video clips combined with structured data like bounding boxes and vehicle speed. It achieves 73% accuracy without any additional training, outperforming existing models by better capturing subtle motion cues and contextual information.

Understanding when a pedestrian intends to cross the road is a critical challenge for autonomous vehicles navigating busy urban environments. Traditional methods often struggle, requiring extensive training data and frequent retraining to adapt to new situations. This is where a new approach, BF-PIP (Beyond Frames Pedestrian Intention Prediction), steps in, offering a zero-shot solution that promises more agile and reliable predictions.

BF-PIP, built upon Google’s Gemini 2.5 Pro, represents a significant leap forward because it processes short, continuous video clips directly, rather than relying on discrete, static image frames. This allows the system to capture subtle, continuous motion cues like hesitation, body shifts, and gaze changes, which are often missed by frame-based systems. The model also incorporates crucial contextual information, such as bounding-box annotations (which pinpoint the pedestrian’s location) and the ego-vehicle’s speed, all fed into the system via specialized multimodal prompts.

The core innovation lies in Gemini 2.5 Pro’s native ability to handle raw video input, enabling a deeper, temporally grounded understanding of pedestrian movement and the surrounding scene. The researchers formulated pedestrian crossing intention prediction as a binary classification task: determining whether a pedestrian will cross or not cross within a fixed future time horizon, specifically one second (30 frames) ahead, based on a 0.5-second (16-frame) observation window.

The system accepts three main types of input: a short, continuous video clip of the pedestrian, bounding box coordinates for precise localization, and the ego-vehicle’s speed to provide contextual reasoning. This multimodal data is embedded into a carefully structured prompt given to Gemini 2.5 Pro. The prompt guides the AI, setting up its role as an autonomous vehicle observer and defining the task. It even incorporates explicit reasoning steps, like analyzing posture and movement patterns, and uses a ‘role-play prompting’ strategy to enhance the model’s contextual awareness and decision-making, mimicking human-like observation.

Evaluated on the JAADbeh dataset, a widely recognized benchmark for autonomous driving research, BF-PIP achieved an impressive 73% prediction accuracy in a zero-shot setting. This performance significantly outperforms a GPT-4V baseline by 18% and even surpasses OmniPredict, a leading MLLM-based approach, by 6%. Notably, BF-PIP achieved these results without any additional training, demonstrating its strong generalization capabilities across diverse traffic scenes.

Qualitative analysis revealed that Gemini 2.5 Pro effectively interprets complex scenes, prioritizing pedestrians closer to the roadway and focusing on subtle behavioral cues such as a forward lean, gaze direction, and decisive micro-movements onto the crosswalk. An ablation study further confirmed the importance of structured visual guidance (annotations) and ego-vehicle context (speed) in enhancing prediction accuracy.

Also Read:

In conclusion, BF-PIP marks a significant advancement in pedestrian intention prediction for autonomous driving. By directly analyzing continuous video streams combined with structured metadata within a prompt-driven, zero-shot framework, it reduces the need for extensive preprocessing and retraining. This capability paves the way for more efficient and safer autonomous driving operations. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

BF-PIP: A Zero-Shot Approach to Pedestrian Intention Prediction Using Gemini 2.5 Pro

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates