TLDR: This research introduces SDG polarity detection, a new task to determine if news text indicates positive, neutral, or negative progress towards specific Sustainable Development Goals (SDGs). The study presents SDG-POD, a benchmark dataset combining human and LLM-generated annotations. It evaluates six LLMs, finding that fine-tuned models, especially QWQ-32B, perform better than zero-shot counterparts, particularly when augmented with synthetic data. The task remains challenging but fine-tuning significantly reduces critical misclassifications, offering valuable tools for sustainability monitoring.
The United Nations’ Sustainable Development Goals (SDGs) provide a crucial framework for addressing global challenges related to society, environment, and economy. While natural language processing (NLP) and large language models (LLMs) have made it easier to classify text based on its relevance to specific SDGs, understanding the direction of this relevance—whether the impact described is positive, neutral, or negative—has remained a significant challenge.
This research introduces a new task called SDG polarity detection. This task aims to determine if a text segment indicates progress towards a specific SDG or conveys an intention to achieve such progress. For example, a text discussing a new policy to reduce hunger would be positive for SDG 2 (“Zero Hunger”), while one describing an emerging famine crisis would be negative, even though both relate to the same SDG.
To support research in this area, the authors developed SDG-POD, a new benchmark dataset. This dataset combines original and synthetically generated data, specifically designed for the SDG polarity detection task. It includes 6,400 texts, each annotated with a polarity label (positive, neutral, or negative) for a given SDG. The training set of SDG-POD was automatically labeled using a majority voting system from five different LLMs, while the test set was meticulously annotated by human experts.
The study performed a comprehensive evaluation using six state-of-the-art large LLMs, testing them in both zero-shot (without specific training) and fine-tuned configurations. The results indicate that this task remains challenging for current LLMs. However, some fine-tuned models, particularly QWQ-32B, showed good performance, especially for specific SDGs like SDG-9 (Industry, Innovation and Infrastructure), SDG-12 (Responsible Consumption and Production), and SDG-15 (Life on Land).
A key finding was that augmenting the fine-tuning dataset with synthetically generated examples significantly improved model performance. This highlights the effectiveness of data enrichment techniques in domains where annotated data is scarce. The researchers used an innovative method to generate synthetic data, integrating outputs from multiple LLMs and applying a majority voting strategy to ensure reliability.
The paper emphasizes that SDG polarity detection is distinct from traditional sentiment analysis. While sentiment analysis focuses on the emotional tone of a text, polarity detection evaluates the actual impact of described actions in relation to sustainability targets. A text could have a positive sentiment but convey negative polarity regarding an SDG, or vice versa.
The evaluation also revealed that fine-tuned models not only achieved higher overall performance but also demonstrated greater robustness. They significantly reduced “critical errors,” such as incorrectly predicting negative labels as positive and vice versa, which are more severe than confusing them with a neutral class. This improvement was particularly evident when using error-weighted F1 metrics, which assign heavier penalties to these critical misclassifications.
This work advances the methodological toolkit for monitoring sustainability efforts and offers practical insights for developing efficient, high-performing polarity detection systems. The complete codebase for the experiments has been made publicly available to ensure reproducibility and encourage further community research. For more details, you can read the full research paper here.
Also Read:
- Unveiling Biases: A New Platform for Analyzing LLMs in European Parliament Debates
- Unpacking SPEED: A New Approach to Evaluating Large Language Models
Future research aims to expand the benchmark to include multilingual data and explore real-world deployment settings for policy monitoring, media analysis, and decision support in the sustainability domain.


