TLDR: A new study investigates whether ChatGPT-based automated coding of communication data from collaborative tasks exhibits bias across gender and racial groups. Analyzing data from negotiation, problem-solving, and decision-making tasks, the research found no significant gender bias. While an initial racial disparity appeared in one task, further analysis revealed it was due to unusually high AI-human agreement for the White reference group, not less accurate coding for Black participants. The findings suggest ChatGPT can code communication data fairly, offering potential for scalable assessment of collaboration skills, though continuous evaluation is recommended.
The rise of large language models (LLMs) like ChatGPT has opened new avenues for automating complex tasks, including the analysis of communication data. Traditionally, coding communication data from collaborative tasks—like identifying instances of idea sharing or negotiation—has been a labor-intensive process performed by trained human raters. While previous research has shown that ChatGPT can achieve accuracy comparable to human coders in these tasks, a crucial question remained: does this automated coding exhibit bias against different demographic groups?
A recent study titled Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks by Jiangang Hao, Wenju Cui, Patrick Kyllonen, and Emily Kerzabi, delves into this very question. The researchers investigated whether ChatGPT-based automated coding of communication data shows consistent performance across gender and racial groups, using a typical coding framework for collaborative problem-solving.
Investigating Fairness Across Demographics
The study utilized data from three distinct types of collaborative tasks: negotiation, problem-solving, and decision-making. These tasks involved teams of four participants collaborating online via text chat, generating thousands of chat turns. The researchers focused on GPT-4o, a top-performing LLM, and designed specific prompts to guide the model in coding chat messages accurately based on a predefined framework that included categories like ‘Maintaining communication,’ ‘Staying on task,’ ‘Eliciting information,’ ‘Sharing information,’ and ‘Acknowledging.’
To assess fairness, the team compared the agreement between AI coding and expert human coding across different gender and racial groups. They employed two statistical approaches: a generalized linear mixed-effects model (GLMM) to account for the nested structure of the data (multiple chat turns from the same individual within teams) and Cohen’s Kappa, a widely recognized measure of inter-rater agreement for categorical items.
Key Findings: No Significant Gender Bias, Nuance in Race
The results offer encouraging insights into the fairness of ChatGPT’s coding. For gender, the analysis revealed no significant differences in AI-human agreement between male and female participants. This suggests that ChatGPT does not systematically favor one gender over the other when coding communication data in these collaborative contexts.
When examining racial groups, the initial findings also indicated no overall evidence of racial bias in AI coding. However, a closer look at task-specific interactions uncovered an interesting nuance. In the Negotiation task, the agreement between AI and human coding for Black participants appeared statistically lower than for the White reference group. Importantly, the researchers clarified that this disparity did not stem from ChatGPT coding chats from Black participants less accurately. Instead, the AI-human agreement for chats from White participants in this specific task was unusually high, even surpassing human-human agreement. This elevated baseline for the White group created the appearance of a racial disparity, rather than a systematic bias against Black participants.
The researchers hypothesize that linguistic features or conversational styles more common among White participants might have aligned more closely with patterns in ChatGPT’s training data, leading to this higher consistency. Other possibilities include the specific distribution of responses in the White group’s Negotiation data matching coding criteria more directly, or even chance sampling variation.
Also Read:
- Evaluating AI Ethics Courses: A Multi-Perspective Approach with AI Assistance
- Addressing User-Facing AI Biases: How Product Teams Uncovered and Fixed “Over-the-Hood” Inclusivity Bugs
Implications for Scalable Assessment
Overall, the study provides robust empirical evidence that ChatGPT can code communication data accurately and fairly across the demographic groups considered. This finding paves the way for its potential adoption in large-scale assessments of collaboration and communication skills, offering a scalable and efficient alternative to traditional labor-intensive manual coding.
While the results are promising, the authors caution that careful evaluation and benchmarking are essential before deploying such AI tools for specific purposes. Factors like evolving LLM capabilities, prompt design, and the complexity of coding frameworks require continuous assessment. The study concludes that while ChatGPT can be a powerful complement to human coding, it should be used with appropriate guardrails to ensure validity, fairness, and reliability in practice, especially in high-stakes assessment contexts.


