Categorizing Book Summaries: An Analysis of Word Embedding and Machine Learning for Turkish Texts

TLDR: This research paper explores the categorical classification of Turkish book summaries using various word embedding techniques (One-Hot Encoding, TF-IDF, Word2Vec) and machine learning algorithms (SVM, Naive Bayes, Logistic Regression, etc.). The study found that stemming as a pre-processing step generally improved results. TF-IDF combined with SVM achieved high accuracy, while Word2Vec performed well with Logistic Regression. One-Hot Encoding showed strong results with Naive Bayes and Logistic Regression.

In the evolving landscape of artificial intelligence and natural language processing, a recent study delves into the fascinating area of classifying book summaries using advanced computational techniques. This research, presented at the 6th International Conference on Data Science and Applications (ICONDATA’24), focuses specifically on Turkish book summaries, addressing the unique challenges posed by the Turkish language structure.

The core of this study revolves around the classification of book summaries into eight distinct categories: fantastic, science fiction, romantic, history, detective, philosophy, cinema, and horror-thriller. To achieve this, the researchers utilized a dataset comprising 3,200 book summaries and their corresponding categories, sourced from a popular online book-selling website, idefix.com.

A crucial step in processing textual data for machine learning is “pre-processing.” This involves cleaning and preparing the text to make it more digestible for algorithms. The study explored various pre-processing methods, including converting all text to lowercase, removing punctuation, numbers, and alphanumeric characters, stemming words (reducing them to their root form), and eliminating “stopwords” (common words like “the,” “a,” “is” that often carry little meaning for classification). The researchers systematically tested different combinations of these pre-processing techniques to observe their impact on classification accuracy.

Following pre-processing, the study employed several “word embedding” techniques. Word embedding is a method that transforms words into numerical vectors, allowing computers to understand the semantic relationships between words. The techniques used were One-Hot Encoding, Word2Vec, and Term Frequency-Inverse Document Frequency (TF-IDF).

Understanding Word Embedding Techniques

One-Hot Encoding: This is a straightforward method where each unique word is represented as a binary vector. While simple, it can lead to very large datasets if there are many unique words, and it doesn’t capture semantic relationships between words.

Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF assigns a numerical weight to each word in a document, reflecting how important a word is to a document in a collection. It considers both how frequently a word appears in a specific document (Term Frequency) and how rare it is across all documents (Inverse Document Frequency). Words with high TF-IDF scores are considered more significant to the document’s content.

Word2Vec: This is a more advanced technique that creates dense vector representations of words. It learns the meaning and relationships of words by analyzing their context within large text datasets. Word2Vec has two main architectures: Continuous Bag of Words (CBOW), which predicts a word based on its surrounding words, and Skip-Gram, which predicts surrounding words given a target word. Word2Vec is particularly effective at capturing semantic similarities, meaning words with similar meanings will have similar vector representations.

After applying these word embedding techniques, the prepared data was fed into various machine learning models for classification. The models tested included K-Nearest Neighbors (KNN), Naive Bayes (NB), Random Forest (RF), Decision Trees (DT), Support Vector Machines (SVM), Logistic Regression (LR), and AdaBoost (AB).

Also Read:

Key Findings

The research yielded insightful results regarding the effectiveness of different combinations of pre-processing, word embedding, and machine learning models. One significant observation was that the “stemming” pre-processing method consistently improved model performance across all word embedding techniques. This suggests that reducing words to their root forms helps the models better understand the underlying meaning.

For models using the TF-IDF technique, the Support Vector Machine (SVM) model achieved the highest success rate and F-Score of 0.8. SVM also generally outperformed other models in other pre-processing combinations when TF-IDF was used.

When Word2Vec was employed, Logistic Regression (LR) and SVM models showed the most promising results. Logistic Regression achieved the highest success and F-Score of 0.72 for the 0010 pre-processing combination (no lowercase, no punctuation removal, stemming, no stopwords removal). This indicates that Logistic Regression effectively utilized the word vectors generated by Word2Vec to represent text more accurately.

For datasets using the One-Hot Encoding method, the Naive Bayes (NB) model, along with Logistic Regression, achieved the highest success rates of 0.81 and 0.80 respectively, particularly with the 1010 and 1011 pre-processing combinations (lowercase conversion, no punctuation removal, stemming, with or without stopwords removal).

Overall, the study found that while pre-processing techniques generally improved performance, their specific combinations did not lead to drastic changes in success rates. The choice of word embedding technique and the machine learning model had a more pronounced impact. This comprehensive comparison of various natural language processing techniques and their combinations provides valuable guidance for future research in this domain. For more detailed information, you can refer to the full research paper available at this link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Categorizing Book Summaries: An Analysis of Word Embedding and Machine Learning for Turkish Texts

Understanding Word Embedding Techniques

Key Findings

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates