TLDR: This research paper details the practical challenges and successes of deploying and integrating AI models in a resource-constrained humanitarian setting. Focusing on a collaboration with Insecurity Insight, an H2H organization, the authors describe how they developed and deployed NLP models to improve news article classification for humanitarian aid, expanding to new domains and languages (French, Arabic). The paper highlights the importance of understanding partner needs, managing resource constraints, addressing data quality issues, and continuously monitoring model performance post-deployment, offering key takeaways for operationalizing AI for social good.
Artificial intelligence (AI) holds immense potential for addressing global challenges, often referred to as ‘AI for Good.’ While much of the focus in this field has been on developing innovative models and conducting foundational research, there’s a significant gap in understanding how these AI solutions are actually deployed, integrated, and maintained in real-world, often resource-constrained, environments. This paper, titled Operationalizing AI for Good: Spotlight on Deployment and Integration of AI Models in Humanitarian Work, sheds light on this crucial, yet often overlooked, final stage of AI for Good projects.
The authors, Anton Abilov, Ke Zhang, Hemank Lamba, Elizabeth M. Olson, Joel Tetreault, and Alex Jaimes from Dataminr, Inc., detail their collaborative experience with Insecurity Insight, a humanitarian-to-humanitarian (H2H) organization. Insecurity Insight provides data-driven intelligence reports to aid agencies and other civil organizations, helping them with resource allocation, humanitarian response, and advocacy. Before this partnership, their workflow involved collecting news articles, classifying them for relevance and category using an SVM model, and then having humanitarian experts review and summarize them. However, this system was limited to existing humanitarian categories and only processed English articles.
The collaboration aimed to address three key goals: improving the existing workflow for identifying and classifying relevant news events, expanding into the new domain of food security, and extending coverage to French and Arabic articles. A significant challenge was the resource-constrained environment of Insecurity Insight, which had limited humanitarian experts for labeling, a low-compute infrastructure (Heroku Basic dyno, a VPS machine, and MongoDB), and minimal engineering staff for maintenance. This meant the AI solution needed to be robust, easily maintainable, and avoid incurring significant additional costs.
The project followed standard machine learning operations (ML Ops) practices, splitting the model development into three stages: offline experimentation, staging deployment calibration, and deployment monitoring.
Offline Experimentation
To expand data sources and languages, the team augmented existing data with GDELT, a large, real-time open-source database of multilingual news articles. Data labeling was a collaborative effort with Insecurity Insight’s humanitarian experts, who reviewed articles for relevance and assigned event categories. Given limited resources, a sample of diverse data was annotated. For model development, two models were trained: a Relevance Model to identify relevant news articles and a Categorization Model to tag articles with humanitarian categories, including the new food security domain. Smaller-sized multilingual transformer models like BERT, RoBERTa, and DistilBERT were evaluated due to compute and latency constraints. XLM-RoBERTa was ultimately selected for deployment due to its strong performance across languages and new input domains.
Staging Deployment Calibration
A critical step was a pre-deployment test in a staging environment. The new system, integrating GDELT and the selected model, ran in parallel with the existing production system for two weeks. This allowed for evaluation of ‘live’ performance and identification of potential issues like content drift or mismatches between offline and online environments. A key aspect was tuning the relevance classification thresholds. The initial baseline threshold would have led to a 20x increase in articles for human review, which was unsustainable. Through discussions, a new threshold was selected that reduced the expected labeling burden increase to 8x, balancing recall and precision with the partner’s operational capacity. Similar analyses were performed for French and Arabic articles.
Post-Deployment Analysis
Four months after deployment, the new system demonstrated significant impact. It surfaced 3.6x more confirmed relevant articles compared to the baseline, with a 3.2x increase in manual labeling effort. The system’s precision improved, aligning with pre-deployment estimates. The GDELT source expansion led to a 23x increase in crawled articles, and the updated classifier predicted 9x more articles as relevant. Notably, a significant number of confirmed relevant articles were surfaced in French and Arabic (42% of the total baseline volume), achieving a key goal of language expansion.
However, challenges emerged. The food security category saw only a marginal increase in surfaced articles, and its F1 score dropped significantly between offline evaluation and production. This was attributed to missing labels and annotation inconsistencies stemming from unclear guidance. Additionally, the relevance model’s performance showed a drop over time across all languages, indicating a risk of degradation due to shifts in live data distribution. To counter this, workflows for continuous monitoring and retraining based on new labeled data were provided to the partner.
Also Read:
- Decoding Political Texts: Advancing AI for Bias Detection
- Agentic AI: A New Era for Supporting Older Adults
Key Takeaways for AI for Good Projects
The paper concludes with five crucial takeaways for practitioners and NGOs involved in AI for Good initiatives:
- Understanding the Problem: Deeply understanding the problem, stakeholder needs, and operational constraints is vital before model development.
- Data Availability and Quality: Assessing data availability, reliability, and bias, and establishing clear annotation guidelines are crucial, while being mindful of domain experts’ time.
- Capacity Building: Partner organizations must be equipped to use and maintain AI solutions, requiring continuous engagement and support mechanisms.
- Model Performance Mismatch Awareness: Expect discrepancies between offline evaluations and real-world performance. Staged testing environments are essential for validation and refinement.
- Impact Assessment and Continuous Monitoring: Clear metrics are needed to measure success, and AI solutions require regular evaluation for performance drift, with automated monitoring and continuous calibration of labeling quality.
This work provides a valuable case study on the practicalities of deploying AI models in humanitarian settings, highlighting the importance of collaboration, adaptability, and continuous monitoring to ensure sustainable and impactful solutions.


