Scaling Qualitative Research: The HICode Approach to Deep Text Analysis

TLDR: HICode is an innovative AI-powered pipeline that uses large language models (LLMs) to perform inductive coding on extensive text datasets. It operates in two main stages: first, it generates fine-grained labels directly from text segments based on a user’s research question, and then it hierarchically clusters these labels into meaningful themes. This method enables researchers to conduct nuanced qualitative analysis at scale, overcoming the limitations of manual labeling and traditional topic modeling. The approach was validated across diverse datasets and demonstrated its utility in a case study analyzing opioid litigation documents, revealing previously hidden marketing strategies.

Researchers often face a dilemma when analyzing large collections of text: either painstakingly label data manually, which doesn’t scale, or use statistical tools like topic modeling, which can be difficult to control and may not capture the nuances of human interpretation. A new approach, called HICode, aims to bridge this gap by leveraging the power of large language models (LLMs) to perform nuanced qualitative analysis on vast amounts of text data.

HICode, developed by Mian Zhong, Pristina Wang, and Anjalie Field from Johns Hopkins University, is a two-part pipeline inspired by traditional qualitative research methods. Its core idea is to inductively generate labels directly from the analysis data and then hierarchically cluster these labels to reveal emergent themes. This allows for a deep, targeted analysis that was previously only feasible with manual effort on smaller datasets.

How HICode Works

The HICode pipeline consists of two main modules:

1. Label Generation: This module takes text segments (e.g., paragraphs from a document) and, guided by a user-defined research question and background information, prompts an LLM to generate clear, concise, and observational labels. The goal is to identify relevant content and assign fine-grained initial codes. For instance, if analyzing sales strategies, an LLM might label a segment as “Targeted Sales Volume Growth” or “Decile-Based Sales Performance.”

2. Hierarchical Clustering: Once initial labels are generated for all relevant text segments, the clustering module takes over. It uses repeated rounds of LLM prompting to group similar labels into higher-level themes. This process is hierarchical, meaning it can distill abstract, insightful themes from many fine-grained labels. A key benefit of this approach is the flexibility it offers: users can control the granularity of the themes, starting with high-level themes and drilling down to more detailed labels as needed.

Unlike traditional topic modeling, which is often unsupervised and focuses on general topics, HICode is designed to be targeted towards specific research questions, making it more controllable and aligned with qualitative research goals. It also differs from deductive coding, where labels are drawn from a pre-existing codebook, as HICode derives labels directly from the data itself.

Validating the Approach

The researchers validated HICode across three diverse datasets: the Media Frames Corpus, Astro Queries, and ML Values. They designed novel automated metrics to compare HICode’s generated themes with human-annotated data, measuring both theme-level and segment-level precision and recall. While automated metrics provide a useful comparison, the inherently interpretive nature of qualitative analysis means exact matches are not always expected.

In a human evaluation study on the Astro Queries dataset, HICode significantly outperformed other methods like TopicGPT, with human annotators confirming that HICode’s themes captured similar information to their own annotations, even if the granularity sometimes differed. This suggests that HICode provides themes that are highly relevant and insightful from a human perspective.

Also Read:

A Case Study in Opioid Litigation

To demonstrate HICode’s real-world potential, the team conducted a case study using the UCSF-JHU Opioid Industry Documents Archive (OIDA). This archive contains millions of internal corporate documents related to opioid litigation, offering crucial insights into the crisis. Manually analyzing such a massive dataset is practically impossible.

By applying HICode to nearly 4,000 emails (parsed into over 160,000 segments) related to “sales contests” from the Mallinckrodt collection, the pipeline uncovered aggressive marketing strategies employed by pharmaceutical companies. HICode identified themes such as “Communication and Engagement,” revealing diverse sales communication techniques; “Crisis Management and Response,” showing strategies to anticipate lost revenue and identify prescriber opportunities; and “Community and Social Responsibility,” which surprisingly suggested corporate awareness of the public harm of their product through internal newsletters. This case study highlights HICode’s ability to facilitate nuanced analyses on large-scale data that were previously infeasible.

HICode represents a significant step forward in computational social science and digital humanities, offering a powerful tool for researchers to conduct deep, nuanced exploratory corpus analysis at scale. For more details, you can read the full research paper here: HICode: Hierarchical Inductive Coding with LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Scaling Qualitative Research: The HICode Approach to Deep Text Analysis

How HICode Works

Validating the Approach

A Case Study in Opioid Litigation

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates