TLDR: A new research paper introduces ATLAS, the first benchmark and a fine-tuned LLaMA-3.3-70B model for Harmonized Tariff Schedule (HTS) code classification. ATLAS significantly outperforms leading LLMs like GPT-5-Thinking and Gemini-2.5-Pro-Thinking in both 10-digit and 6-digit HTS classification accuracy, while also being substantially more cost-effective and supporting privacy-preserving self-hosting. This work addresses a critical bottleneck in global trade by providing an open-source dataset and model to improve product classification and compliance.
Accurately classifying products under the Harmonized Tariff Schedule (HTS) is a fundamental yet challenging aspect of global trade. Misclassifications can lead to severe consequences, including shipment delays or even complete suspensions of deliveries, as seen with major postal operators halting services to the U.S. due to incomplete customs documentation. This critical bottleneck has, until now, received limited attention from the machine learning community.
A recent research paper introduces a significant advancement in this area: ATLAS. This work presents the first benchmark for HTS code classification, meticulously derived from the U.S. Customs Rulings Online Search System (CROSS). Beyond just a benchmark, the researchers have developed a specialized model, also named ATLAS, which is a fine-tuned version of LLaMA-3.3-70B.
The Challenge of HTS Classification
Every product entering the global market must be assigned an HTS code. These ten-digit codes are standardized by the World Customs Organization (WCO), with the first six digits being globally harmonized and the latter four being country-specific extensions. The HTS itself is a deeply hierarchical system, spanning 22 sections, 99 chapters, and thousands of subheadings. This intricate structure makes accurate classification a complex task, often requiring nuanced distinctions that are difficult for humans to manage at scale.
The sheer volume of the HTS, which comprises over 17,000 pages of PDF documents, makes manual assignment impractical. Recent trade policy changes, such as modifications to the de minimis exemption, further underscore the urgency for automated solutions, as more imported goods now require valid HTS codes.
Introducing ATLAS: A Specialized AI Solution
The ATLAS model represents a significant leap forward. It was developed by fine-tuning LLaMA-3.3-70B using a supervised fine-tuning (SFT) approach on the newly created CROSS dataset. This dataset, a key contribution of the research, was built by systematically scraping legally binding decisions from the U.S. Customs and Border Protection (CBP), then transforming these lengthy, unstructured rulings into a machine-learning-ready prompt-response format.
The researchers benchmarked ATLAS against leading proprietary and open-source models, including GPT-5-Thinking and Gemini-2.5-Pro-Thinking. The results demonstrate ATLAS’s superior performance:
- For fully correct 10-digit classifications (U.S.-specific), ATLAS achieved 40% accuracy, a substantial +15 points improvement over GPT-5-Thinking and +27.5 points over Gemini-2.5-Pro-Thinking.
- For partially correct 6-digit classifications (globally harmonized), ATLAS reached 57.5% accuracy, outperforming GPT-5-Thinking by 2 points.
- On average, ATLAS correctly predicted 6.3 digits out of 10, indicating a finer-grained understanding of tariff codes.
Beyond Accuracy: Cost-Efficiency and Data Privacy
In addition to its impressive accuracy, ATLAS offers significant practical advantages. It is estimated to be nearly 5 times cheaper than GPT-5-Thinking and 8 times cheaper than Gemini-2.5-Pro-Thinking for inference. This cost-efficiency is crucial for large-scale deployment in global trade operations.
Furthermore, ATLAS can be self-hosted, which is a vital feature for industries dealing with sensitive trade and compliance data, such as automotives, industrials, and semiconductors. Self-hosting ensures data privacy by keeping sensitive information within secure, controlled environments.
Also Read:
- Local LLMs Face Hurdles in Complex Coding Challenges, Study Reveals
- MSCoRe: A New Benchmark for Evaluating Multi-Stage Reasoning in LLM Agents
A New Frontier for LLM Research
By releasing both the dataset and the fine-tuned model, the researchers aim to establish HTS classification as a new community benchmark task. Despite ATLAS setting a strong baseline, the benchmark remains highly challenging, with 10-digit accuracy still at 40%. This underscores the need for continued innovation in areas like retrieval augmentation, reasoning, and alignment methods to further advance progress on this high-impact global trade problem.
This research, detailed in the paper ATLAS: Benchmarking and Adapting LLMs for Global Trade via Harmonized Tariff Code Classification, paves the way for more efficient, accurate, and secure product classification, directly contributing to the resilience of global trade and supply chains.


