TLDR: A new AI system addresses challenges in e-commerce listing generation by combining multimodal and multitask learning with a hierarchical generation process. It uses a single vision model to predict product attributes and price, then feeds these predicted attributes into a text generator to create factually consistent descriptions, significantly reducing “hallucinations” and improving efficiency compared to existing methods.
Creating compelling and accurate product descriptions for e-commerce listings is a crucial yet time-consuming task for retailers. While generative AI, particularly Vision-to-Language Models (VLMs), offers automation, current systems often struggle with factual inaccuracies, commonly known as “hallucinations.” Additionally, using separate, single-task models for different aspects like attribute tagging, description writing, and price estimation can be inefficient and fail to capture the holistic understanding of a product.
A new research paper introduces an innovative end-to-end, multi-task system designed to generate factually-grounded textual listings directly from a single product image. This system aims to overcome the challenges of architectural fragmentation and factual inconsistency prevalent in existing solutions.
The Core Innovations
The proposed system is built upon two key contributions:
First, it applies a multi-task learning approach to fine-tune a vision encoder. This means a single vision backbone is trained simultaneously on multiple objectives: predicting product attributes (such as color, hemline, and neck style) and estimating the product’s price. This joint training forces the model to learn a rich, shared visual representation that benefits all related tasks, leading to a more comprehensive understanding of the product.
Second, the system introduces a hierarchical generation process to significantly reduce factual hallucinations. Instead of directly generating text from the image, the model first predicts a set of structured attributes. These predicted attributes are then embedded into a prompt, which is fed to the text decoder. This approach constrains the language model, ensuring that the generated text remains faithful to the visual evidence and the model’s own factual predictions.
How the System Works
At its heart, the system uses a Vision Transformer (ViT) as its encoder, which processes the product image to extract both fine-grained local features (like fabric texture) and holistic context (like overall style). This visual information is then passed to task-specific heads for attribute prediction and price estimation.
For text generation, a T5-Small model acts as the decoder. During the inference phase, the classification head first predicts the product’s attributes. These predicted attributes are then used to construct a detailed, factual prompt. This prompt, combined with the visual embedding from the encoder, guides the T5 decoder to generate the final product description and name. This two-step “predict then generate” strategy is crucial for maintaining factual consistency.
The entire system is optimized using a joint loss function, combining losses from attribute classification, price regression, and text generation. This ensures that the model learns the intricate relationships between visual cues, attributes, and market value.
Key Findings and Performance
Experiments demonstrated the superiority of this architecture. The multi-tasking approach outperformed independent models, showing a 3.6% better R2 value in price prediction and a 6.6% improvement in F1-score for attribute classification. This indicates that learning multiple tasks together helps the model capture subtle visual cues that influence product value.
Critically, the hierarchical generation process proved highly effective in combating factual hallucinations. It slashed the hallucination rate from 12.7% to 7.1% – a significant 44.5% relative reduction – compared to a non-hierarchical approach. While this improvement in factual consistency came with a minor trade-off in text fluency (a 3.5% lower ROUGE-L score compared to direct vision-to-language models), the paper argues that factual accuracy is paramount in commercial settings.
Furthermore, the hierarchical approach also reduced the latency of the autoregressive text generation process by a factor of 3.5, making the system more efficient.
Also Read:
- Enhancing E-commerce Visual Search with Adaptive Intent Mining
- Enhancing Image Generation with Web-Sourced Factual Knowledge
Looking Ahead
Despite its advancements, the paper acknowledges several limitations and outlines future research directions. The visual-only price prediction still faces challenges due to non-visual factors like brand identity. Future work could involve fusing the learned image embeddings with other data modalities, such as brand or seller IDs, to improve price accuracy.
The current model operates on a single “hero” image and is trained on a specific domain (women’s clothing in India). Expanding the vision encoder to handle multiple images or videos, and adapting the model to other product categories, are important next steps. Integrating Retrieval-Augmented Generation (RAG) pipelines could also provide real-time market context, and scaling the language module to larger LLMs could enhance text quality and persuasiveness.
This research presents a robust framework for generating accurate and efficient e-commerce listings, highlighting the power of integrated multimodal and multitask learning. You can read the full paper here: A Multimodal, Multitask System for Generating E-Commerce Text Listings from Images.


