Assessing Soft Skills: How Large Language Models Are Advancing Situational Judgment Tests

TLDR: A research paper explores using Large Language Models (LLMs) to automatically identify personal and professional skill features in open-response Situational Judgment Tests (SJTs), like the Casper SJT. The study found that LLMs, especially reasoning models, can effectively extract complex features, and their performance significantly improves with detailed prompting. This approach offers a scalable solution for assessing crucial soft skills, potentially enabling automated scoring and real-time feedback.

In today’s rapidly evolving world, academic programs and employers are increasingly recognizing that personal and professional skills are just as crucial as technical expertise for success. Skills like communication, teamwork, problem-solving, and critical thinking are vital, but measuring them accurately and at scale has always been a significant challenge.

Traditionally, tools like grade point averages (GPAs) and standardized tests have focused on “hard skills.” While reference letters, personal essays, and interviews have attempted to gauge personal attributes, they often fall short of rigorous psychometric standards and face growing concerns about authenticity, especially with the rise of generative AI.

Situational Judgment Tests (SJTs) have emerged as a more reliable and standardized way to assess these “soft skills.” SJTs present individuals with hypothetical scenarios and ask how they would react. Open-response SJTs, where participants provide written or spoken answers, are particularly effective at measuring behavioral tendencies. However, these have historically relied on trained human raters, which is a labor-intensive process that makes large-scale implementation difficult.

Past attempts to automate scoring for SJTs using Natural Language Processing (NLP) faced issues because the features typically used in NLP (like grammar or coherence) don’t directly relate to the personal and professional skills SJTs aim to measure. Unlike essays, SJTs often don’t have a single “correct” answer, focusing instead on the complexity and diversity of responses.

A Novel Approach Using Large Language Models

A recent research paper, available here, explores a groundbreaking method to overcome these challenges by using Large Language Models (LLMs) to identify and extract relevant features from open-response SJT answers. Building on previous work that identified nine key features influencing human evaluations, this study investigated how well LLMs could automatically detect seven of these complex, nuanced features.

The researchers utilized data from the Casper SJT, a widely used open-response assessment designed to measure competencies such as collaboration, communication, empathy, ethics, problem-solving, and self-awareness. Casper presents scenarios (text or video-based) and asks respondents to explain how they would handle the situation. For this study, only typed responses were analyzed.

The Study’s Design and Findings

The study involved two main experiments. In the first, five state-of-the-art LLMs (including GPT-4o mini, DeepSeek-R1, Llama 4 Maverick, o4-mini, and Claude Sonnet 4) were compared. Using a “zero-shot” prompting approach (meaning no examples were given to the LLMs), the models were tasked with classifying features in 162 Casper responses. The goal was to see how well their classifications aligned with those of human raters.

The results showed that reasoning models like Claude Sonnet 4 and OpenAI’s o4-mini generally performed best, demonstrating a strong ability to identify complex constructs even with minimal instructions. Interestingly, different LLMs excelled at different features. For instance, GPT-4o mini was remarkably good at identifying responses where individuals stated they lacked enough information, while DeepSeek-R1 performed well on identifying creative arguments.

While promising, the initial zero-shot results indicated that LLMs didn’t quite reach human-level agreement for most features. This often stemmed from a misalignment in how LLMs separated the different levels of a feature (e.g., “no justification” vs. “reasonable justification”), with LLMs tending to favor middle-ground classifications.

This led to the second study, where researchers investigated whether providing more detailed instructions, including specific inclusion and exclusion criteria for each feature level, could improve LLM performance. Focusing on o4-mini due to its balance of performance and efficiency, they found that this “prompt engineering” strategy significantly improved the LLM’s agreement with human raters. For some features, like identifying disrespect, the improvement was substantial.

Also Read:

Looking Ahead: The Future of Automated Skill Assessment

This research offers a compelling vision for the future of assessing personal and professional skills. The findings suggest that LLMs can effectively extract construct-relevant features from open-response SJTs, paving the way for more scalable and standardized evaluation systems. The study also hints that a hybrid approach, potentially using different LLMs for different features or even ensemble methods, could yield even more accurate results.

Future work will involve expanding the dataset, exploring additional features, and refining prompt engineering techniques. Crucially, an automated scoring system based on this approach could provide real-time evaluation and personalized feedback to respondents, transforming how individuals develop these essential skills. Beyond assessments, this method could also be applied to analyze other forms of written text, such as personal essays and reference letters, enhancing their authenticity and utility.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing Soft Skills: How Large Language Models Are Advancing Situational Judgment Tests

A Novel Approach Using Large Language Models

The Study’s Design and Findings

Looking Ahead: The Future of Automated Skill Assessment

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

New Jersey Educators Navigate the Integration of AI in Classrooms with Caution and Optimism

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates