TLDR: This research introduces a new framework to evaluate how large language models (LLMs) understand logical relationships across different languages, including mixed-language (code-switched) scenarios. Using synthetic, logic-based data translated into diverse languages, the study found that code-switching surprisingly does not harm, and can even improve, LLM performance, suggesting that translation variations might act as a beneficial signal for robustness.
Large language models (LLMs) are becoming increasingly common in diverse language settings. However, a crucial question remains: how well do these models maintain consistent, logically sound understanding across different languages? A new research paper tackles this underexplored area by introducing a controlled evaluation framework for multilingual natural language inference (NLI).
A New Approach to Evaluating LLMs
Natural Language Inference (NLI) is a fundamental task in natural language understanding, where a model determines if a hypothesis is entailed by, contradicts, or is neutral with respect to a given premise. This task is excellent for testing a model’s deep reasoning capabilities. While NLI has been widely used to assess LLMs, evaluations have largely focused on high-resource languages like English, often within broader tasks like question answering, which limits insights into how inference capabilities transfer across languages under controlled conditions.
To address this gap, the researchers developed a synthetic multilingual NLI framework. This framework stress-tests cross-lingual semantic alignment using deterministic, logic-based templates that encode entailment, contradiction, and neutrality. This unique approach separates the logical structure from linguistic and cultural biases, avoiding annotation noise and enabling direct, large-scale evaluation. The core contributions include a logic-driven method for generating synthetic multilingual NLI datasets, an automated evaluation protocol for measuring cross-lingual consistency, and empirical evidence of systematic weaknesses in multilingual alignment across various models and languages.
How the Study Was Conducted
The methodology involved several key steps. First, a synthetic English NLI dataset was created using hand-crafted templates based on abstract quantifier patterns. These templates were populated with semantically coherent noun phrases to ensure plausibility. This design allowed for precise control over compositional structure and minimized linguistic noise, isolating reasoning ability from lexical variation.
Next, this English dataset was automatically translated into a typologically and script-diverse set of target languages using high-performance neural machine translation systems. The selected languages included Arabic (ar), German (de), French (fr), Hindi (hi), and Swahili (sw), covering both high- and low-resource settings and various language families and scripts. This diversity helped uncover weaknesses that might be hidden in more homogeneous evaluations.
A crucial aspect of the study was the introduction of a “code-switching” condition. In this setup, the premise and hypothesis were presented in different languages. For example, a premise in English might be paired with a hypothesis in Hindi. This allowed the researchers to evaluate whether models could maintain semantic accuracy under mixed-lingual input, a common but rarely systematically assessed phenomenon in multilingual communication.
Six multilingual instruction-tuned LLMs were evaluated: Fanar-9B, Gemma-7B, LLaMA-3-8B, Mistral-7B-v0.3, Phi-4, and Qwen3-7B. These models were chosen for their diversity in architecture, size, and training data. All models were tested in a zero-shot setting, meaning they received no task-specific fine-tuning. The evaluation covered 36 language pairings, with 1,000 examples per pairing, balanced across the three NLI labels.
Surprising Findings: Code-Switching Can Improve Performance
The results revealed significant insights. In monolingual settings (where premise and hypothesis are in the same language), Fanar-9B consistently achieved the highest accuracy across all languages, while Gemma-7B generally recorded the lowest. English typically yielded the highest accuracy, followed by French and German, though some models like LLaMA-3-8B showed minimal variance across languages. Interestingly, Swahili, despite being a lower-resource language, did not consistently underperform, sometimes matching Indo-European languages in accuracy for models like Fanar-9B and Gemma-7B.
Perhaps the most surprising finding came from the code-switching conditions. Several models actually outperformed their monolingual baselines in specific code-switched configurations. For instance, Gemma-7B showed markedly higher accuracy on many bilingual pairs (e.g., English-Hindi) compared to its English-English performance. Similarly, Mistral-7B-v0.3 performed better on some cross-lingual inputs (e.g., Arabic-English) than on its corresponding monolingual Arabic. These patterns challenge the common assumption that semantic alignment necessarily degrades when models reason across linguistic boundaries.
The study suggests that translation-induced lexical or syntactic variation might act as a “regularization signal,” potentially improving alignment within the multilingual representation space. Accuracy gains from code-switching were unevenly distributed, with Hindi, Swahili, or Arabic as the hypothesis language sometimes yielding higher performance than English. This could be due to morphologically richer or syntactically simpler constructions in those translations, potentially benefiting models that might overfit statistical artifacts in high-resource languages.
Ensuring Data Quality
To ensure the reliability of their findings, the researchers conducted a cross-lingual analysis to verify the semantic consistency of their translated data. They visualized sentence embeddings using UMAP, showing that translations of the same sentence formed tight clusters, even across typologically distant languages. This indicated high semantic consistency, meaning the encoder mapped them to similar representations despite variations in word order, morphology, or script.
Furthermore, they assessed translation quality by computing cosine similarity scores between English sentences and their translated counterparts. Scores were consistently high across all languages, with French and German showing the strongest alignment, and even lower-resource languages like Swahili maintaining average similarities above 0.8. These results confirm that the multilingual dataset preserved logical structure and meaning, establishing a reliable basis for cross-lingual inference evaluation.
Also Read:
- Bridging the Language Divide: How Code-Switching Improves LLM Performance
- Unlocking Cross-Lingual Reasoning in AI: Insights from Long Chain-of-Thought Studies
Looking Ahead
This research provides a controlled and insightful evaluation of multilingual semantic alignment in LLMs. It highlights that reasoning performance in code-switched settings can surprisingly match or even exceed monolingual performance, suggesting a greater robustness in cross-lingual representations than previously recognized. The findings open new avenues for exploring code-switching as a deliberate strategy to improve reasoning performance in multilingual applications.
While the synthetic nature of the dataset allowed for precise control, future work could explore supplementing this data with more linguistically diverse or naturally occurring sentences. Additionally, while high-quality machine translation was used and assessed, future extensions might involve human verification of translations or direct generation of language-native examples by multilingual LLMs to further reduce potential translation noise. For more technical details, you can refer to the full research paper here.


