TLDR: Pinterest has developed a highly scalable and cost-effective system for extracting structured product data from e-commerce websites. Their solution, called Visual Page Representation (VPR), combines structural, visual, and text modalities of a webpage into a compact form. This allows simpler machine learning models like XGBoost to achieve high accuracy, outperforming more complex and expensive Large Language Models (LLMs) like GPT. The system processes over 1,000 URLs per second at a significantly lower cost, demonstrating a practical and efficient method for large-scale web information extraction.
The internet is a vast ocean of information, but much of it is unstructured, making it difficult for applications to understand and utilize. For a platform like Pinterest, which helps users discover and save ideas, extracting structured product data from e-commerce websites is crucial. This data enhances user experiences, improves content distribution, ensures content quality, and drives website traffic. With over 500 million monthly active users and 500 billion ‘Pins’, Pinterest faced a significant challenge: how to accurately and scalably extract this data at a manageable cost.
Traditional methods for web data extraction often fall short. Simple approaches like using metadata embedded in HTML (schema.org, Open Graph) are frequently unreliable or incomplete. Older machine learning methods, like Wrapper Induction, require a separate model for each website, which becomes unmanageable at Pinterest’s scale. More modern deep learning models, while powerful, are often too expensive and computationally intensive to run on every webpage.
Introducing Visual Page Representation (VPR)
To overcome these challenges, Pinterest developed a novel approach centered around what they call Visual Page Representation (VPR). VPR is a compact yet expressive way to represent a webpage, combining its structural (HTML), visual (layout, styles), and textual information. Imagine taking a snapshot of a webpage that not only captures what you see but also understands the underlying HTML structure and how elements are visually arranged. This includes details like text size, colors, and even if text has a strikethrough, which is crucial for identifying sale prices.
VPR is generated by a Pinterest-developed rendering service based on the Chromium browser. This service processes a webpage URL and captures all visible HTML nodes, their text, images, and important attributes like links and styles. This rich representation allows simpler machine learning models to understand complex web page layouts and extract information accurately.
How Pinterest’s System Works
The system operates through three main workflows: rendering, training, and extraction.
-
Rendering: A webpage URL is fed into a renderer, which generates the VPR of the page.
-
Training: Human annotators use a custom labeling tool to mark specific attributes (like price or title) directly on the VPR. This labeled data is then used to train machine learning models, specifically eXtreme Gradient Boosting (XGBoost) models. These models learn to classify the type of page (e.g., product page, error page) and then extract specific attributes from product pages.
-
Extraction: When a new webpage comes in, the system first determines its type using a ‘Page Type Classifier’. If it’s a product page, a ‘Product Attributes Extractor’ then identifies and pulls out key information like the product title, currency, sale price, list price, and main image.
The power of VPR lies in its ability to provide comprehensive contextual understanding. Unlike pure HTML, it captures visual relationships. Unlike simple screenshots, it retains underlying semantic information like image URLs. This dual-layered information allows for highly accurate mapping of visual elements to their functional roles.
Efficiency and Cost-Effectiveness
One of the most significant achievements of this system is its cost-effectiveness. While Large Language Models (LLMs) like GPT can perform similar extraction tasks, they are significantly more expensive. Pinterest’s research showed that even the cheapest GPT alternatives were orders of magnitude more costly than their VPR + XGBoost solution. In fact, their system is about 1000 times more cost-effective than the cheapest GPT models.
To further reduce costs, Pinterest also implemented a clever ‘distillation’ process. Once the accurate VPR-based XGBoost models are trained, they are used to automatically label data for simpler, HTML-only ‘Wrapper Induction’ models for specific domains. These HTML-only models are much cheaper to run because they don’t require the more computationally intensive visual rendering step. This allowed Pinterest to transition approximately 60% of their domains to this more cost-effective approach without sacrificing accuracy.
Also Read:
- Enhancing Pinterest Ads with a Unified View of User Behavior
- AI Enhances Traffic Enforcement: New Methods for Vehicle and License Plate Recognition from Video
Real-World Impact
The system has been successfully deployed in production across more than 8,000 websites. It achieves an impressive 98% average precision across key attributes like main image, title, availability, and prices. Crucially, it can process over 1,000 URLs per second, demonstrating remarkable scalability. The average cost to process 1,000 URLs, including rendering and extraction, is incredibly low, at just $0.0079.
In conclusion, Pinterest’s innovative use of Visual Page Representation combined with cost-effective XGBoost models has enabled them to build a highly scalable, accurate, and affordable system for extracting structured product data from the vast and varied landscape of the internet. This approach highlights how integrating visual and structural information can lead to powerful and practical solutions for web data extraction, as detailed in their research paper available at Cross-Domain Web Information Extraction at Pinterest.


