spot_img
HomeResearch & DevelopmentEnhancing Library Subject Analysis with a Hybrid AI Approach

Enhancing Library Subject Analysis with a Hybrid AI Approach

TLDR: A new research paper introduces a hybrid AI framework that combines machine learning models with large language models (LLMs) to improve subject analysis in library cataloging. This framework guides LLMs to predict the optimal number of Library of Congress Subject Headings (LCSH) and post-processes outputs to correct ‘hallucinated’ terms, significantly boosting the accuracy and relevance of assigned subject terms.

Libraries play a vital role in organizing information, and a key part of this is subject analysis – determining the core topics of a resource and assigning standardized terms to help users find what they need. Traditionally, this has been a manual, labor-intensive process, often using systems like the Library of Congress Subject Headings (LCSH). However, with collections growing rapidly, manual analysis struggles to keep up.

While machine learning (ML) and deep learning (DL) methods have been proposed to automate this, they face challenges like data imbalance and limited generalization. More recently, large language models (LLMs) have emerged as powerful tools for natural language processing, but their application in subject analysis is still underexplored. LLMs can sometimes generate too many terms, or even ‘hallucinate’ terms that aren’t part of the standardized vocabulary.

A New Hybrid Approach

To overcome these limitations, researchers have proposed a novel hybrid framework that combines the strengths of traditional ML models with LLMs. This approach aims to leverage the LLMs’ ability to understand context and generate natural language, while using ML models to provide precision and control over the output.

The framework operates in three main phases:

First, the researchers explored how LLMs perform on their own. They tested different LLMs using various prompt engineering techniques, including ‘zero-shot’ learning (where the model gets no examples) and ‘Chain-of-Thought’ (CoT) prompting, which guides the LLM through a step-by-step reasoning process. They also fine-tuned LLMs using existing library datasets, finding that techniques like Low-Rank Adaptation (LoRA) significantly improved performance and efficiency compared to updating the entire model.

Second, to address the issue of LLMs generating an inconsistent number of subject terms, the framework introduces a guiding mechanism. Small, efficient ML models are trained to predict the optimal number of LCSH terms for a given book’s title and abstract. This predicted number then acts as a constraint, guiding the LLM to generate a more appropriate quantity of terms, preventing over-generation and improving the relevance of the output.

Third, to combat the problem of LLM ‘hallucinations’ – generating terms not found in the official LCSH vocabulary – a post-processing step is implemented. This step identifies any non-standard terms generated by the LLM and replaces them with the most semantically similar valid LCSH terms. This ensures that all assigned subject headings are standardized and accurate, enhancing the usability of the system.

Also Read:

Promising Results

Experiments using the Llama-3.1-8B model demonstrated significant improvements. The Chain-of-Thought approach enhanced the diversity and richness of the generated terms. Fine-tuning, especially with LoRA, proved highly effective, increasing the model’s understanding of domain-specific tasks. The integration of ML models to predict the optimal number of terms drastically reduced the average number of generated terms while improving precision and the F1-score (a measure of accuracy).

Crucially, the post-processing step, which maps hallucinated terms to valid LCSH entries, led to a notable increase in both recall (capturing more relevant terms) and precision (reducing irrelevant terms). Overall, the hybrid framework improved the recall of LCSH term generation from a baseline of 43% to 63%, and precision from 8% to 26%.

This research offers a practical and scalable solution for automating subject analysis in libraries, providing a flexible balance between comprehensive coverage and precise term assignment. While the study acknowledges limitations, such as the subjective nature of subject analysis and resource constraints for testing more advanced LLMs, it lays a strong foundation for future work, including human evaluation and exploring techniques like retrieval-augmented generation (RAG).

For more details, you can read the full research paper: A Hybrid Framework for Subject Analysis.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -