spot_img
HomeResearch & DevelopmentA New Milestone in Vietnamese Legal AI: Introducing the...

A New Milestone in Vietnamese Legal AI: Introducing the VLQA Dataset

TLDR: The VLQA dataset is the first comprehensive, large, and high-quality Vietnamese dataset for legal question answering. It comprises over 3,000 real-world legal questions from citizens, meticulously annotated by legal professionals with references to nearly 60,000 statutory articles. This dataset addresses the critical scarcity of resources for legal Natural Language Processing (NLP) in low-resource languages like Vietnamese, enabling the development and evaluation of more reliable legal AI systems. Experiments with state-of-the-art models show its utility, while also highlighting current limitations of large language models (LLMs) in legal reasoning, such as factual inaccuracies and hallucinations.

The field of Artificial Intelligence (AI) and Natural Language Processing (NLP) has seen remarkable advancements, particularly with the rise of large language models (LLMs). These powerful models are increasingly being explored for complex tasks, including those within the legal domain. However, despite their impressive capabilities, there’s a significant gap between their current performance and the ultimate goal of fully automating legal tasks. This challenge is even more pronounced in countries with distinct legal systems and languages, especially those considered ‘low-resource’ in terms of available digital data, such as Vietnam.

Legal NLP in Vietnamese faces a major hurdle: the scarcity of high-quality, annotated data. This lack of labeled legal corpora is critical for training, validating, and fine-tuning AI models for legal applications. Addressing this pressing need, a new research paper introduces a groundbreaking resource: the VLQA dataset.

What is VLQA?

VLQA, which stands for Vietnamese Legal Question Answering, is introduced as the first comprehensive, large, and high-quality dataset specifically designed for the Vietnamese legal domain. It aims to bridge the gap between complex legal knowledge and public understanding by providing a robust foundation for developing advanced legal AI systems.

The dataset is unique for several reasons. Firstly, it comprises over 3,000 real-world legal questions posed by Vietnamese citizens, ensuring that the data reflects genuine concerns encountered by everyday people. Secondly, these questions are meticulously annotated by legal professionals, with references to relevant statutory articles drawn from an expansive corpus of approximately 59,000 legal provisions. This makes VLQA the largest expert-verified legal question answering dataset covering any statutory domain to date. Lastly, it provides both the relevant legal articles and detailed, long-form answers, supporting two fundamental legal NLP tasks: information retrieval and question answering.

How VLQA Was Built

The creation of the VLQA dataset involved a meticulous four-phase process to ensure its quality and comprehensiveness. It began with the collection of an expansive article-based legal corpus, encompassing 2,162 legal documents across 27 common domains within the Vietnamese legal framework. This corpus, consisting of 59,636 articles, captures the structural hierarchy of Vietnamese legislation, providing rich context for legal queries.

Following this, question-answer-article triplets were collected from well-known online legal consultation platforms in Vietnam. These platforms feature user-generated legal concerns, ensuring the real-world relevance of the questions. A rigorous filtering process was applied to remove irrelevant or duplicated content and to align explicit legal references within the answers to the collected articles.

A crucial phase was expert validation. A professional annotation team, comprising senior law students supervised by an experienced legal expert, meticulously reviewed and refined the collected answers and their associated legal articles. Each question-answer pair was independently annotated by two annotators to minimize bias, and a legal expert conducted the final quality assurance, checking for clarity, validity (ensuring references to current statutory articles), and fluency of the answers.

Finally, the verified data, consisting of 3,129 high-quality triplets, was partitioned into training, validation, and testing subsets to facilitate model development and evaluation.

Key Findings from Experiments

The researchers conducted extensive experiments to establish strong baselines on the VLQA dataset, evaluating various state-of-the-art retrieval and legal question-answering methods. For legal article retrieval, models like BGE-m3 showed strong performance in zero-shot settings, while fine-tuned models like mBERT significantly outperformed zero-shot baselines, underscoring the importance of domain adaptation for effective legal information retrieval.

In legal question answering, both extractive and generative models were benchmarked. Extractive models, which pull answers directly from text, generally performed well on lexical overlap metrics. Generative models, which create new answers, excelled in contextual semantic similarity. Interestingly, a smaller, fine-tuned generative model like BARTpho achieved performance comparable to much larger LLMs, further highlighting the benefits of domain-specific training.

The study also evaluated the in-context learning capabilities of recent LLMs, including open-weight models like Qwen2.5 and commercial offerings like GPT-4o and DeepSeek-V3. While GPT-4o-mini consistently achieved high scores, the research revealed a significant observation: while LLMs can generate fluent and well-structured responses, human evaluation often uncovered factual inaccuracies, incompleteness, or even ‘hallucinated’ elements not present in the source text. This disparity between superficial language generation and robust legal reasoning indicates a substantial area for future improvement in AI for legal applications.

Also Read:

Looking Ahead

The introduction of the VLQA dataset marks a significant step forward for legal NLP in Vietnam and for low-resource languages globally. By making this dataset publicly available, the researchers aim to foster further innovation in legal AI. The findings from their comprehensive evaluations highlight both the potential and the current limitations of state-of-the-art models in handling the nuances of legal information. Future work will focus on developing robust, end-to-end frameworks capable of processing lengthy legal texts, addressing complex legal queries, and meeting real-world application needs, ultimately contributing to more accessible and trustworthy legal assistance tools. You can find more details about this research in the paper available at arXiv:2507.19995.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -