Improving Coordinate Prediction in Multimodal AI Through Positional Encoding Guidance

TLDR: Multimodal AI models struggle with precise coordinate prediction in high-resolution images due to weaknesses in how they process spatial information (positional encodings), leading to systematic biases. Researchers developed Vision-PE Shuffle Guidance (VPSG), a training-free method that uses ‘negative evidence’ from shuffled positional encodings to correct these biases at inference time, significantly improving coordinate prediction accuracy on challenging datasets like ScreenSpot-Pro without requiring model retraining.

Multimodal large language models (MLLMs) have shown remarkable capabilities in understanding and generating language, especially when combined with visual information for tasks like visual question answering and document analysis. However, a significant hurdle remains: precisely predicting coordinates, such as identifying a specific point or bounding box on a screen. This challenge becomes even more pronounced with high-resolution images, which can lead to errors in how the model understands spatial relationships.

A recent research paper, titled “MITIGATING COORDINATE PREDICTION BIAS FROM POSITIONAL ENCODING FAILURES,” by Xingjian Tao, Yiwei Wang, Yujun Cai, Yihong Luo, and Jing Tang, delves into this problem. The authors investigate how MLLMs behave when their visual positional encodings (VPEs) – the internal signals that tell the model where things are in an image – are intentionally disrupted. Their findings reveal that these disruptions don’t cause random errors; instead, they induce predictable, directional biases in the coordinate predictions. This suggests that when clear spatial cues are missing or degraded, models tend to fall back on their own internal assumptions or ‘priors’ about where things should be.

Crucially, the researchers observed similar directional error patterns in natural high-resolution datasets, indicating that the weakening of positional encodings is a major bottleneck for accurate coordinate prediction at scale. Imagine trying to click a tiny button on a very large screen – if the AI’s sense of ‘where’ is fuzzy, it will often miss, but not randomly; it might consistently miss in a particular direction.

Introducing Vision-PE Shuffle Guidance (VPSG)

To tackle this issue, the paper proposes a novel method called Vision-PE Shuffle Guidance (VPSG). This is a training-free, test-time approach, meaning it can be applied to existing, already-trained MLLMs without needing to retrain them or change their core architecture. VPSG works by leveraging the directional nature of these biases for correction.

Here’s a simplified breakdown of how VPSG operates:

It runs the MLLM normally to get a ‘position-conditioned’ prediction, which is the model’s best guess given all the visual information.
In parallel, it runs auxiliary decodings where the visual positional encodings are deliberately shuffled. This creates a ‘position-unconditioned’ reference – essentially, what the model would predict if it had no reliable spatial information.
VPSG then uses the discrepancy between these two outputs as ‘negative evidence.’ It guides the model’s digit prediction (for the x and y coordinates) by suppressing the tendencies that persist even when positional cues are removed. This helps to amplify the information that is consistent with correct positions.
A lightweight finite-state machine ensures that the coordinate format (e.g., ) is preserved throughout the process.

Two key design choices make VPSG precise and stable. First, instead of relying on a single shuffled run, VPSG aggregates multiple shuffled routes in a clever way (log space) to get a more robust estimate of the position-unconditioned bias. Second, it applies a ‘position-aware coefficient schedule,’ meaning the guidance strength is adjusted. It starts strong for the most influential digits of the x-coordinate, decays for subsequent digits, resets for the first digit of the y-coordinate, and then decays again. This targeted approach prevents over-correction on less certain digits and maintains natural numeric formatting.

Experimental Validation

The effectiveness of VPSG was demonstrated through experiments on the ScreenSpot-Pro dataset, a challenging benchmark for GUI grounding that features real high-resolution desktop screenshots. The method showed reliable improvements when applied to the Qwen2.5-VL models (3B and 7B parameters). For instance, on the Qwen2.5-VL-3B model, the overall percentage of correct predictions increased from 11.6% to 13.3%, with notable gains in various text-oriented and icon-oriented categories.

These results highlight that VPSG provides consistent benefits across different model scales and interaction types, enhancing both text- and icon-based behaviors by mitigating the spurious effects that arise when positional signals are unreliable. The ablation studies further confirmed the importance of both the multi-seed aggregation and the coefficient decay components for VPSG’s success.

Also Read:

Conclusion

The research underscores the critical role of robust positional encoding for fine-grained spatial reasoning in MLLMs. VPSG offers a practical, plug-in solution that can significantly improve coordinate prediction accuracy without requiring any changes to the model’s training or architecture. This advancement is crucial for a wide range of applications requiring precise spatial grounding, from object manipulation to GUI automation.

For more detailed information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Improving Coordinate Prediction in Multimodal AI Through Positional Encoding Guidance

Introducing Vision-PE Shuffle Guidance (VPSG)

Experimental Validation

Conclusion

Gen AI News and Updates

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

FaithAct: A Framework for Verifying AI’s Visual Reasoning Steps

Smarter Training for Multimodal AI: How Data Difficulty Shapes Learning

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates