TLDR: A new study reveals a significant ‘Demos’ Position in Prompt bias’ (DPP bias) in Large Language Models (LLMs), showing that the placement of in-context learning demonstrations within a prompt drastically affects model accuracy and prediction stability. Placing demos at the start of the prompt generally yields better and more stable results, while placing them at the end of the user message can severely degrade performance. The optimal position varies by model and task, highlighting the need for careful prompt engineering.
Large Language Models (LLMs) have transformed how we approach machine learning, particularly through a powerful technique called In-Context Learning (ICL). ICL allows these models to learn new tasks with just a few examples, or ‘demonstrations,’ included directly within the prompt. This capability enables rapid adaptation without the need for extensive retraining.
However, recent research has highlighted that ICL’s performance can be surprisingly sensitive to how these demonstrations are presented. Factors like the choice of examples and their order have been shown to influence results. Now, a new study uncovers another critical, previously unexplored factor: the position of these demonstrations within the prompt itself. This newly identified phenomenon is termed the ‘Demos’ Position in Prompt bias,’ or DPP bias.
The research, titled “Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning” by Kwesi Cobbina and Tianyi Zhou from the University of Maryland, College Park, reveals that simply changing where demonstrations, system prompts, and user messages are placed can drastically alter an LLM’s predictions and accuracy. Imagine moving a block of examples from the beginning of your instruction to the very end of your question – this seemingly minor change can lead to significant shifts in how the model responds.
To systematically investigate this bias, the researchers designed an evaluation framework that considers four distinct positions for demonstrations within a prompt:
Understanding the Four Demo Positions
-
Start of System Prompt (ssp): Demos are placed right at the very beginning of the system’s instructions, before any other guidance.
-
End of System Prompt (esp): Demos are positioned at the end of the system’s instructions, but still before the user’s actual question.
-
Start of User Message (sum): Demos are inserted at the beginning of the user’s message, just before the query text.
-
End of User Message (eum): Demos are appended at the very end of the user’s message, after the query.
The study introduced two new metrics, Accuracy-Change and Prediction-Change, to precisely measure the gains in correctness and the volatility of outputs caused by these positional shifts. Extensive experiments were conducted across ten LLMs from four popular open-source families (QWEN, LLAMA 3, MISTRAL, COHERE) and various tasks, including classification, question answering, summarization, and reasoning.
Also Read:
- Beyond Words: GPT-4’s Emotional Response Patterns Unveiled
- Decoding Chain-of-Thought: Information Flow in Language Models
Key Findings on Positional Bias
The results consistently showed that placing demonstrations at the beginning of the prompt (ssp or esp) generally leads to more stable and accurate outputs, sometimes yielding accuracy gains of up to +6 points. In stark contrast, placing demonstrations at the end of the user message (eum) proved to be highly detrimental, flipping over 30% of predictions in some QA tasks without improving correctness. This position often led to significant performance degradation.
Smaller LLMs were found to be most susceptible to this positional sensitivity, meaning their performance is more heavily impacted by where demos are placed. However, even larger models, while more robust, still showed a noticeable, albeit marginal, sensitivity on more complex tasks.
A crucial insight from the study is that there is no single, universally optimal position for demonstrations. The best placement can vary significantly depending on the specific model architecture and the type of task being performed. For instance, while early positions often dominated for classification, some larger models on arithmetic or summarization tasks occasionally benefited from demos placed closer to the user query.
The researchers hypothesize that this DPP bias stems from two main factors: the inherent architecture of causal-decoder LLMs, which tend to give more weight to earlier tokens due to their training process, and the positional regularities present in the instruction-tuning datasets used to train these models. This means LLMs might have learned a preference for certain demo placements during their training.
Looking ahead, the paper suggests potential mitigation strategies for this bias. These include test-time calibration, where the optimal demo position is dynamically selected for each input, and post-training on datasets where demonstration positions are randomly permuted. Such approaches could help LLMs develop more position-invariant representations, leading to more reliable ICL performance.
In conclusion, this research highlights that prompt formatting is not merely a stylistic choice but a functionally critical aspect of interacting with LLMs. For anyone working with instruction-tuned LLMs, it underscores the importance of explicitly evaluating demonstration placement rather than relying on default settings. Understanding and addressing DPP bias is essential for designing more robust and dependable in-context learning systems.


