ChatGPT Shows Fair Performance in Coding Collaborative Communication Data

TLDR: A new study investigates whether ChatGPT-based automated coding of communication data from collaborative tasks exhibits bias across gender and racial groups. Analyzing data from negotiation, problem-solving, and decision-making tasks, the research found no significant gender bias. While an initial racial disparity appeared in one task, further analysis revealed it was due to unusually high AI-human agreement for the White reference group, not less accurate coding for Black participants. The findings suggest ChatGPT can code communication data fairly, offering potential for scalable assessment of collaboration skills, though continuous evaluation is recommended.

The rise of large language models (LLMs) like ChatGPT has opened new avenues for automating complex tasks, including the analysis of communication data. Traditionally, coding communication data from collaborative tasks—like identifying instances of idea sharing or negotiation—has been a labor-intensive process performed by trained human raters. While previous research has shown that ChatGPT can achieve accuracy comparable to human coders in these tasks, a crucial question remained: does this automated coding exhibit bias against different demographic groups?

A recent study titled Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks by Jiangang Hao, Wenju Cui, Patrick Kyllonen, and Emily Kerzabi, delves into this very question. The researchers investigated whether ChatGPT-based automated coding of communication data shows consistent performance across gender and racial groups, using a typical coding framework for collaborative problem-solving.

Investigating Fairness Across Demographics

The study utilized data from three distinct types of collaborative tasks: negotiation, problem-solving, and decision-making. These tasks involved teams of four participants collaborating online via text chat, generating thousands of chat turns. The researchers focused on GPT-4o, a top-performing LLM, and designed specific prompts to guide the model in coding chat messages accurately based on a predefined framework that included categories like ‘Maintaining communication,’ ‘Staying on task,’ ‘Eliciting information,’ ‘Sharing information,’ and ‘Acknowledging.’

To assess fairness, the team compared the agreement between AI coding and expert human coding across different gender and racial groups. They employed two statistical approaches: a generalized linear mixed-effects model (GLMM) to account for the nested structure of the data (multiple chat turns from the same individual within teams) and Cohen’s Kappa, a widely recognized measure of inter-rater agreement for categorical items.

Key Findings: No Significant Gender Bias, Nuance in Race

The results offer encouraging insights into the fairness of ChatGPT’s coding. For gender, the analysis revealed no significant differences in AI-human agreement between male and female participants. This suggests that ChatGPT does not systematically favor one gender over the other when coding communication data in these collaborative contexts.

When examining racial groups, the initial findings also indicated no overall evidence of racial bias in AI coding. However, a closer look at task-specific interactions uncovered an interesting nuance. In the Negotiation task, the agreement between AI and human coding for Black participants appeared statistically lower than for the White reference group. Importantly, the researchers clarified that this disparity did not stem from ChatGPT coding chats from Black participants less accurately. Instead, the AI-human agreement for chats from White participants in this specific task was unusually high, even surpassing human-human agreement. This elevated baseline for the White group created the appearance of a racial disparity, rather than a systematic bias against Black participants.

The researchers hypothesize that linguistic features or conversational styles more common among White participants might have aligned more closely with patterns in ChatGPT’s training data, leading to this higher consistency. Other possibilities include the specific distribution of responses in the White group’s Negotiation data matching coding criteria more directly, or even chance sampling variation.

Also Read:

Implications for Scalable Assessment

Overall, the study provides robust empirical evidence that ChatGPT can code communication data accurately and fairly across the demographic groups considered. This finding paves the way for its potential adoption in large-scale assessments of collaboration and communication skills, offering a scalable and efficient alternative to traditional labor-intensive manual coding.

While the results are promising, the authors caution that careful evaluation and benchmarking are essential before deploying such AI tools for specific purposes. Factors like evolving LLM capabilities, prompt design, and the complexity of coding frameworks require continuous assessment. The study concludes that while ChatGPT can be a powerful complement to human coding, it should be used with appropriate guardrails to ensure validity, fairness, and reliability in practice, especially in high-stakes assessment contexts.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ChatGPT Shows Fair Performance in Coding Collaborative Communication Data

Investigating Fairness Across Demographics

Key Findings: No Significant Gender Bias, Nuance in Race

Implications for Scalable Assessment

Gen AI News and Updates

Press Ranger and OtterlyAI Forge Alliance to Boost AI Search Visibility

MUFG Forges Alliance with OpenAI to Revolutionize Banking with Generative AI

PhonePe Integrates OpenAI’s ChatGPT for Enhanced User Experience in India

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates