TLDR: The ‘Align-then-Slide’ framework addresses the challenges of evaluating document-level machine translation (doc-MT) from large language models (LLMs). It first ‘Aligns’ source and target documents by inferring sentence-level correspondences and reconstructing the target to match the source sentence count, handling omissions and complex mappings. Then, it uses an ‘n-Chunk Sliding Evaluate’ stage to calculate multi-granularity metric scores, effectively assessing quality across different chunk sizes. Experiments show high correlation with human judgments and demonstrate its utility in guiding reinforcement learning for improved translation quality.
Large language models (LLMs) have brought about significant advancements in document-level machine translation (doc-MT), allowing for translations that are not only fluent but also maintain the overall meaning and coherence of an entire document. However, evaluating these whole-document translations has posed a challenge for existing methods. Traditional evaluation metrics often assume a perfect sentence-by-sentence alignment between the source and translated texts, a scenario that rarely holds true in real-world translations.
In practice, document translations can involve complexities like sentences being omitted entirely, multiple source sentences translating into a single target sentence (many-to-one mapping), or a single source sentence expanding into several target sentences (one-to-many mapping). Furthermore, different translation systems might produce varying numbers of target sentences, further complicating direct comparison.
Introducing Align-then-Slide
To address these challenges, researchers from Huawei Translation Services Center have introduced a novel evaluation framework called Align-then-Slide. This framework provides a comprehensive and robust way to assess the quality of ultra-long document-level machine translations. The approach unfolds in two distinct stages:
Stage 1: Align
The first stage, ‘Align’, focuses on establishing a precise sentence-level correspondence between the source document and its translation. It begins by independently segmenting both the original and translated texts into individual sentences. Next, a similarity matrix is built by calculating sentence-level alignment scores using reference-free metrics like COMET-Kiwi or LaBSE. This matrix helps identify how well each source sentence matches each target sentence.
A dynamic programming (DP) algorithm is then employed to find the optimal alignment path. This path effectively maps source sentences to their best-matched target sentences. Crucially, the target sequence is then reconstructed to exactly match the number of sentences in the source document. This reconstruction involves inserting placeholder sentences for any source sentences that were omitted in the translation and concatenating target sentences where a single source sentence resulted in multiple translated sentences. This process ensures a one-to-one conceptual alignment, neutralizing length discrepancies across different translation systems.
Stage 2: n-Chunk Sliding Evaluate
Once the sentence-level alignment is established, the second stage, ‘n-Chunk Sliding Evaluate’, performs a multi-granularity quality assessment. This stage extends the concept of n-grams (common in text analysis) to document chunks. It calculates averaged metric scores for spans of 1, 2, 3, and 4 consecutive sentences (chunks). A sliding window, with a fixed stride of one, moves across the document, evaluating each chunk.
The multi-chunk approach is particularly important for handling many-to-one mappings. While the ‘Align’ stage might initially mark a source sentence as missing if it shares a translation with another, larger chunks (e.g., 2-chunk, 3-chunk) can group adjacent source sentences. This allows for a more accurate re-matching with the translated text, ensuring that such complexities do not unfairly penalize the translation quality. This hierarchical, chunk-based evaluation provides a comprehensive assessment, sensitive to both fine-grained omissions and broader contextual coherence.
Also Read:
- Unlocking LLM Potential: How JudgeAgent Dynamically Evaluates AI
- Unveiling the Silent Thought Processes of Large Language Models
Validation and Impact
The effectiveness of Align-then-Slide has been rigorously validated through extensive experiments. On the WMT 2020 Chinese→English benchmark, the framework’s system-level rankings showed a high Pearson correlation of 0.929 with expert-based MQM (Multidimensional Quality Metrics) scores. Furthermore, on a newly curated real-world test set involving various LLMs, Align-then-Slide again demonstrated strong agreement with human judgments, achieving a Pearson correlation of 0.943.
Beyond evaluation, the framework also proves actionable in improving translation systems. The preference data generated by Align-then-Slide can be directly used for training reinforcement learning models, such as CPO (Contrastive Preference Optimization) and GRPO (Generative Reinforcement Learning with Policy Optimization). Human evaluations confirmed that systems trained with the guidance of Align-then-Slide significantly outperformed traditional supervised fine-tuning (SFT) baselines, yielding translations that humans preferred. This highlights its utility not just for assessment but also for steering the development of higher-quality document-level MT systems.
The research also demonstrated the framework’s robustness, showing consistent results regardless of the specific sentence pre-segmentation tools (like spaCy or ersatz) or alignment models (like COMETKiwi or LaBSE) used within its ‘Align’ stage. While the computational cost for very long documents remains a consideration, the Align-then-Slide framework represents a significant step forward in accurately and comprehensively evaluating the complex outputs of modern document-level machine translation systems. You can read the full research paper here.


