TLDR: VLM-SlideEval is a new framework to test how well Vision-Language Models (VLMs) understand presentation slides. It evaluates them on extracting elements, handling changes (perturbations), and understanding the story flow across multiple slides. The study found that while newer VLMs are better at extracting content from single slides, all models struggle with precise details like pixel-accurate styling and understanding the narrative order of a full presentation. This highlights the need for more refined AI evaluators for slide generation and analysis tools.
Vision-language models, or VLMs, are becoming increasingly important for evaluating various types of digital content, including presentation slides. However, how well these AI models truly understand the nuances of presentation slides has largely been an unexplored area, despite their growing role in automated content creation and critique.
A new research framework, VLM-SlideEval, aims to shed light on this very question. Developed by Hyeonsu B. Kang, Emily Bao, and Anjan Goswami from PowerPoint AI at Microsoft Inc., this framework systematically evaluates VLMs across three critical dimensions:
Understanding Slide Elements
The first dimension focuses on the VLM’s ability to accurately extract individual elements from slide images. This includes identifying text, shapes, images, and their precise positions, sizes, and styles, comparing them against a known ‘ground truth’ derived from PowerPoint’s underlying data.
Robustness to Changes
The second dimension tests how robust VLMs are to controlled alterations or ‘perturbations’ in slide design. Imagine slight changes to an element’s position (geometry), its visual style (like font or color), or even the text content itself. VLM-SlideEval introduces these changes systematically to see if the models can still accurately interpret the slide despite the modifications.
Higher-Level Comprehension
Finally, the framework probes the VLM’s capacity for more complex understanding, such as piecing together the narrative flow of an entire presentation. This involves tasks like reordering a shuffled set of slides back into their original, logical sequence, which requires a deeper grasp of the content’s context and progression.
To build this evaluation, the researchers used publicly available presentation decks and meticulously extracted ground-truth data from PowerPoint XML and live renderings. This allowed them to create a standardized and verifiable schema for comparison.
Also Read:
- Gaze-VLM: Enhancing AI’s Understanding of Human Actions Through Eye Gaze
- New Benchmark Reveals Vision-Language Models Struggle with Subtle Object State Recognition
Key Findings and Implications
The empirical results from VLM-SlideEval offer valuable insights. While newer VLMs, such as o3 and GPT-5 variants, generally performed better than older models like GPT-4.1 and GPT-4o, all models showed some significant limitations. They struggled with pixel-accurate extraction of elements and exhibited inconsistent behavior when faced with controlled perturbations. More notably, VLMs did not reliably capture the narrative structure across multiple slides, indicating a challenge in understanding the broader story a presentation aims to tell.
However, the models did show competence in understanding content on a single-slide basis. This suggests that while they can interpret individual pieces of information, connecting those pieces into a coherent narrative remains a hurdle.
These findings are crucial for the future of AI-driven content creation. They highlight that current VLMs have limits when it comes to fine-grained slide evaluation. The research advocates for the development of more ‘calibrated, critic-in-the-loop’ evaluators. These would be AI systems that can provide precise and verifiable feedback, guiding iterative improvements and selections in automated presentation generation pipelines.
The paper, titled “VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT,” provides a comprehensive look at these challenges and opportunities. You can read the full research paper here.


