TLDR: This paper introduces a two-stage system for generating scientific figure captions that are both accurate and stylistically consistent with the author’s writing. It uses the LaMP-Cap dataset, combining context filtering and category-specific prompt optimization in the first stage, and then applies few-shot prompting with author profile figures for stylistic refinement in the second stage. Experiments show significant improvements in ROUGE-1 recall and BLEU scores, demonstrating the effectiveness of integrating contextual understanding with author-specific stylistic adaptation.
Scientific figures are essential for conveying complex information, but writing accurate and stylistically consistent captions for them can be a time-consuming and challenging task for researchers. This is where automated caption generation systems come into play, offering a promising solution to enhance scientific communication efficiency.
A recent research paper, “Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge,” by Watcharapong Timklaypachara, Monrada Chiewhawan, Nopporn Lekuthai, and Titipat Achakulvisut, introduces an innovative system designed to generate high-quality scientific figure captions. The system focuses on integrating figure-related textual context with the unique writing styles of individual authors.
The core of their approach is a two-stage pipeline. The first stage is all about generating content-grounded captions, ensuring they are relevant and minimize irrelevant information. This stage involves several key steps:
Stage 1: Content-Grounded Caption Generation
First, the system employs a sentence-based filtering mechanism to identify and retain only the most relevant information from input paragraphs. This helps in reducing noise and focusing on the core content related to the figure.
Next, it utilizes category-level prompt optimization. Scientific papers often fall into specific domains like Computer Science, Mathematics, or Biology. The researchers developed category-focused prompt templates using advanced tools like MIPROv2 and SIMBA from the DSPy Toolkit. These tools help create instruction-example pairs and apply feedback-driven optimization to generate more precise and relevant captions for each paper category.
Finally, for papers that might span multiple categories, a caption candidate selection process is used. An advanced language model, Gemini-2.5 Flash, acts as a reranker, evaluating multiple caption candidates generated for different categories and selecting the single best one based on clarity, relevance, accuracy, and tone.
Also Read:
- Mimicking Human Cognition: Dynamic Gating for Efficient Vision-Language AI
- Enhancing Multimodal Models for Complex Object Descriptions with Chain-of-Thought Reasoning
Stage 2: Profile-Informed Stylistic Refinement
The second stage addresses the need for personalization and precision. While the first stage ensures content accuracy, this stage refines the captions to match the author’s specific writing style. This is achieved through few-shot prompting, using ‘profile figures’ from the same paper. These profile captions serve as structural references, helping the system adapt to the author’s consistent writing style, and also enforcing a caption length limit for conciseness.
The researchers evaluated their system using various metrics like BLEU and ROUGE scores, which measure how closely generated captions match reference captions in terms of lexical overlap and structural similarity. Their experiments demonstrated significant improvements. Category-specific prompts, for instance, outperformed general approaches, boosting ROUGE-1 recall by +8.3%.
Even more impressively, the profile-informed stylistic refinement led to substantial gains, with BLEU scores improving by 40–48% and ROUGE precision by 25–27%. This indicates that the system can generate captions that are not only scientifically accurate but also stylistically faithful to the source paper, making them more consistent with the author’s overall manuscript.
This work highlights the power of combining a deep understanding of contextual information with an adaptation to author-specific writing styles. It represents a significant step forward in automating scientific communication, potentially reducing the manual burden on researchers and improving the consistency of scientific manuscripts. You can read the full paper here.


