The Hidden Impact of Demo Placement on Large Language Model Performance

TLDR: A new study reveals a significant ‘Demos’ Position in Prompt bias’ (DPP bias) in Large Language Models (LLMs), showing that the placement of in-context learning demonstrations within a prompt drastically affects model accuracy and prediction stability. Placing demos at the start of the prompt generally yields better and more stable results, while placing them at the end of the user message can severely degrade performance. The optimal position varies by model and task, highlighting the need for careful prompt engineering.

Large Language Models (LLMs) have transformed how we approach machine learning, particularly through a powerful technique called In-Context Learning (ICL). ICL allows these models to learn new tasks with just a few examples, or ‘demonstrations,’ included directly within the prompt. This capability enables rapid adaptation without the need for extensive retraining.

However, recent research has highlighted that ICL’s performance can be surprisingly sensitive to how these demonstrations are presented. Factors like the choice of examples and their order have been shown to influence results. Now, a new study uncovers another critical, previously unexplored factor: the position of these demonstrations within the prompt itself. This newly identified phenomenon is termed the ‘Demos’ Position in Prompt bias,’ or DPP bias.

The research, titled “Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning” by Kwesi Cobbina and Tianyi Zhou from the University of Maryland, College Park, reveals that simply changing where demonstrations, system prompts, and user messages are placed can drastically alter an LLM’s predictions and accuracy. Imagine moving a block of examples from the beginning of your instruction to the very end of your question – this seemingly minor change can lead to significant shifts in how the model responds.

To systematically investigate this bias, the researchers designed an evaluation framework that considers four distinct positions for demonstrations within a prompt:

Understanding the Four Demo Positions

Start of System Prompt (ssp): Demos are placed right at the very beginning of the system’s instructions, before any other guidance.
End of System Prompt (esp): Demos are positioned at the end of the system’s instructions, but still before the user’s actual question.
Start of User Message (sum): Demos are inserted at the beginning of the user’s message, just before the query text.
End of User Message (eum): Demos are appended at the very end of the user’s message, after the query.

The study introduced two new metrics, Accuracy-Change and Prediction-Change, to precisely measure the gains in correctness and the volatility of outputs caused by these positional shifts. Extensive experiments were conducted across ten LLMs from four popular open-source families (QWEN, LLAMA 3, MISTRAL, COHERE) and various tasks, including classification, question answering, summarization, and reasoning.

Also Read:

Key Findings on Positional Bias

The results consistently showed that placing demonstrations at the beginning of the prompt (ssp or esp) generally leads to more stable and accurate outputs, sometimes yielding accuracy gains of up to +6 points. In stark contrast, placing demonstrations at the end of the user message (eum) proved to be highly detrimental, flipping over 30% of predictions in some QA tasks without improving correctness. This position often led to significant performance degradation.

Smaller LLMs were found to be most susceptible to this positional sensitivity, meaning their performance is more heavily impacted by where demos are placed. However, even larger models, while more robust, still showed a noticeable, albeit marginal, sensitivity on more complex tasks.

A crucial insight from the study is that there is no single, universally optimal position for demonstrations. The best placement can vary significantly depending on the specific model architecture and the type of task being performed. For instance, while early positions often dominated for classification, some larger models on arithmetic or summarization tasks occasionally benefited from demos placed closer to the user query.

The researchers hypothesize that this DPP bias stems from two main factors: the inherent architecture of causal-decoder LLMs, which tend to give more weight to earlier tokens due to their training process, and the positional regularities present in the instruction-tuning datasets used to train these models. This means LLMs might have learned a preference for certain demo placements during their training.

Looking ahead, the paper suggests potential mitigation strategies for this bias. These include test-time calibration, where the optimal demo position is dynamically selected for each input, and post-training on datasets where demonstration positions are randomly permuted. Such approaches could help LLMs develop more position-invariant representations, leading to more reliable ICL performance.

In conclusion, this research highlights that prompt formatting is not merely a stylistic choice but a functionally critical aspect of interacting with LLMs. For anyone working with instruction-tuned LLMs, it underscores the importance of explicitly evaluating demonstration placement rather than relying on default settings. Understanding and addressing DPP bias is essential for designing more robust and dependable in-context learning systems.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

The Hidden Impact of Demo Placement on Large Language Model Performance

Understanding the Four Demo Positions

Key Findings on Positional Bias

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates