TLDR: This research introduces a new framework for creating high-quality, human-annotated datasets for suicidal ideation detection, specifically developing a Turkish social media corpus. The study evaluates existing AI models across this new Turkish dataset and popular English datasets, revealing that many models struggle with reliably detecting suicidal ideation, especially when trained on automatically labeled data. It highlights the critical need for more trustworthy annotation practices and transparent model evaluation in mental health NLP to ensure AI tools are genuinely effective and ethical for suicide prevention.
Suicide remains a significant global public health concern, particularly among young adults. While artificial intelligence (AI) offers promising avenues for real-time suicide prevention, its progress is often hindered by two major challenges: a lack of diverse language coverage in datasets and unreliable data annotation practices. Most existing datasets are in English, and even within these, high-quality, human-annotated data is scarce. This often leads studies to rely on pre-labeled datasets without thoroughly examining their annotation quality or label reliability. The absence of datasets in languages other than English further limits the global impact of AI in suicide prevention.
A recent study, titled Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation, addresses these critical gaps. The researchers focused on constructing a novel Turkish suicidal ideation corpus from social media posts and introduced an innovative, resource-efficient annotation framework. This framework involved three human annotators and two large language models (LLMs). Following this, they conducted a comprehensive evaluation of label reliability and model consistency across their new Turkish dataset and three popular English suicidal ideation detection datasets. They utilized transfer learning through eight pre-trained sentiment and emotion classifiers to assess annotation consistency and benchmark model performance against manually labeled data.
Building a Trustworthy Turkish Dataset
The study specifically tackled the scarcity of Turkish suicidal ideation corpora. They collected 7,874 Turkish social media posts from Eks ¸i S¨ozl¨uk, a prominent text-based platform in Turkey. The annotation process was designed to be both reliable and resource-efficient. Two researchers initially annotated the data, and in cases of disagreement, a sophisticated decision-making strategy was employed. For non-sensitive disagreements (where the presence of suicidal ideation was not disputed, only its specific nuance), two LLMs, ChatGPT-4.o and Gemini 2.5, were used as tie-breakers. If an LLM’s label exactly matched one of the human annotators, that became the final label. For sensitive cases, where annotators disagreed on the very presence of suicidal ideation, a third expert annotator made the final decision. This multi-level approach ensured high reliability while managing resources effectively. The consistency of labels for authors with multiple posts further supported the validity of their annotation framework.
Evaluating English Datasets and Model Performance
To provide a comprehensive cross-lingual evaluation, the researchers selected three widely used English Reddit datasets: C-SSRS, SDD, and SWMH. C-SSRS is notable for its human-annotated labels by practicing psychiatrists, making it a ‘gold-standard’ dataset. In contrast, SDD and SWMH were primarily auto-labeled based on subreddit origins.
The evaluation involved four Turkish and four English transformer models, chosen for their diversity and popularity. The findings were quite revealing. The Turkish transformer models struggled to differentiate between suicidal and non-suicidal posts in the newly created Turkish dataset, indicating a need for more fine-tuning on nuanced, context-sensitive mental health data in Turkish.
Even more striking were the results for the English datasets. Despite the C-SSRS dataset being part of the fine-tuning set for one of the suicidal ideation detection models (SENTINET), both SENTINET and RoBERTa performed with near-random accuracy on it. However, these same models achieved significantly high F1 and AUC scores (over 96% and 99% respectively) on the auto-labeled SDD and SWMH datasets. This stark contrast suggests that the models were not learning the underlying cues of suicidal ideation but rather superficial patterns, such as subreddit identifiers or community-specific phrasing, which were used as labeling factors in the auto-labeled datasets.
Also Read:
- Synthetic Emotions: How AI is Creating Diverse Text for Emotion Recognition
- DialogueForge: Advancing Human-Chatbot Conversation Simulation with LLMs
Implications for Trustworthy AI in Mental Health
The study’s findings underscore a critical need for more rigorous, language-inclusive approaches to annotation and evaluation in mental health Natural Language Processing (NLP). It challenges the common practice of relying on auto-labeled datasets for suicidal ideation detection and using off-the-shelf, fine-tuned models without validating their training and fine-tuning sets. The researchers argue that models performing well on auto-labeled data might merely be identifying the source of a post rather than actual suicidal ideation.
The authors advocate for greater transparency in model training pipelines and dataset construction practices in mental health NLP. They emphasize that while AI systems for suicidal ideation detection are crucial for future prevention efforts, they are meant to encourage individuals to seek professional help and should never replace clinical judgment. The study serves as a blueprint for building high-reliability mental health datasets across languages, prioritizing data and model reliability over mere scale, speed, or convenience.


