spot_img
HomeResearch & DevelopmentAssessing Soft Skills: How Large Language Models Are Advancing...

Assessing Soft Skills: How Large Language Models Are Advancing Situational Judgment Tests

TLDR: A research paper explores using Large Language Models (LLMs) to automatically identify personal and professional skill features in open-response Situational Judgment Tests (SJTs), like the Casper SJT. The study found that LLMs, especially reasoning models, can effectively extract complex features, and their performance significantly improves with detailed prompting. This approach offers a scalable solution for assessing crucial soft skills, potentially enabling automated scoring and real-time feedback.

In today’s rapidly evolving world, academic programs and employers are increasingly recognizing that personal and professional skills are just as crucial as technical expertise for success. Skills like communication, teamwork, problem-solving, and critical thinking are vital, but measuring them accurately and at scale has always been a significant challenge.

Traditionally, tools like grade point averages (GPAs) and standardized tests have focused on “hard skills.” While reference letters, personal essays, and interviews have attempted to gauge personal attributes, they often fall short of rigorous psychometric standards and face growing concerns about authenticity, especially with the rise of generative AI.

Situational Judgment Tests (SJTs) have emerged as a more reliable and standardized way to assess these “soft skills.” SJTs present individuals with hypothetical scenarios and ask how they would react. Open-response SJTs, where participants provide written or spoken answers, are particularly effective at measuring behavioral tendencies. However, these have historically relied on trained human raters, which is a labor-intensive process that makes large-scale implementation difficult.

Past attempts to automate scoring for SJTs using Natural Language Processing (NLP) faced issues because the features typically used in NLP (like grammar or coherence) don’t directly relate to the personal and professional skills SJTs aim to measure. Unlike essays, SJTs often don’t have a single “correct” answer, focusing instead on the complexity and diversity of responses.

A Novel Approach Using Large Language Models

A recent research paper, available here, explores a groundbreaking method to overcome these challenges by using Large Language Models (LLMs) to identify and extract relevant features from open-response SJT answers. Building on previous work that identified nine key features influencing human evaluations, this study investigated how well LLMs could automatically detect seven of these complex, nuanced features.

The researchers utilized data from the Casper SJT, a widely used open-response assessment designed to measure competencies such as collaboration, communication, empathy, ethics, problem-solving, and self-awareness. Casper presents scenarios (text or video-based) and asks respondents to explain how they would handle the situation. For this study, only typed responses were analyzed.

The Study’s Design and Findings

The study involved two main experiments. In the first, five state-of-the-art LLMs (including GPT-4o mini, DeepSeek-R1, Llama 4 Maverick, o4-mini, and Claude Sonnet 4) were compared. Using a “zero-shot” prompting approach (meaning no examples were given to the LLMs), the models were tasked with classifying features in 162 Casper responses. The goal was to see how well their classifications aligned with those of human raters.

The results showed that reasoning models like Claude Sonnet 4 and OpenAI’s o4-mini generally performed best, demonstrating a strong ability to identify complex constructs even with minimal instructions. Interestingly, different LLMs excelled at different features. For instance, GPT-4o mini was remarkably good at identifying responses where individuals stated they lacked enough information, while DeepSeek-R1 performed well on identifying creative arguments.

While promising, the initial zero-shot results indicated that LLMs didn’t quite reach human-level agreement for most features. This often stemmed from a misalignment in how LLMs separated the different levels of a feature (e.g., “no justification” vs. “reasonable justification”), with LLMs tending to favor middle-ground classifications.

This led to the second study, where researchers investigated whether providing more detailed instructions, including specific inclusion and exclusion criteria for each feature level, could improve LLM performance. Focusing on o4-mini due to its balance of performance and efficiency, they found that this “prompt engineering” strategy significantly improved the LLM’s agreement with human raters. For some features, like identifying disrespect, the improvement was substantial.

Also Read:

Looking Ahead: The Future of Automated Skill Assessment

This research offers a compelling vision for the future of assessing personal and professional skills. The findings suggest that LLMs can effectively extract construct-relevant features from open-response SJTs, paving the way for more scalable and standardized evaluation systems. The study also hints that a hybrid approach, potentially using different LLMs for different features or even ensemble methods, could yield even more accurate results.

Future work will involve expanding the dataset, exploring additional features, and refining prompt engineering techniques. Crucially, an automated scoring system based on this approach could provide real-time evaluation and personalized feedback to respondents, transforming how individuals develop these essential skills. Beyond assessments, this method could also be applied to analyze other forms of written text, such as personal essays and reference letters, enhancing their authenticity and utility.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -