TLDR: This research paper explores the categorical classification of Turkish book summaries using various word embedding techniques (One-Hot Encoding, TF-IDF, Word2Vec) and machine learning algorithms (SVM, Naive Bayes, Logistic Regression, etc.). The study found that stemming as a pre-processing step generally improved results. TF-IDF combined with SVM achieved high accuracy, while Word2Vec performed well with Logistic Regression. One-Hot Encoding showed strong results with Naive Bayes and Logistic Regression.
In the evolving landscape of artificial intelligence and natural language processing, a recent study delves into the fascinating area of classifying book summaries using advanced computational techniques. This research, presented at the 6th International Conference on Data Science and Applications (ICONDATA’24), focuses specifically on Turkish book summaries, addressing the unique challenges posed by the Turkish language structure.
The core of this study revolves around the classification of book summaries into eight distinct categories: fantastic, science fiction, romantic, history, detective, philosophy, cinema, and horror-thriller. To achieve this, the researchers utilized a dataset comprising 3,200 book summaries and their corresponding categories, sourced from a popular online book-selling website, idefix.com.
A crucial step in processing textual data for machine learning is “pre-processing.” This involves cleaning and preparing the text to make it more digestible for algorithms. The study explored various pre-processing methods, including converting all text to lowercase, removing punctuation, numbers, and alphanumeric characters, stemming words (reducing them to their root form), and eliminating “stopwords” (common words like “the,” “a,” “is” that often carry little meaning for classification). The researchers systematically tested different combinations of these pre-processing techniques to observe their impact on classification accuracy.
Following pre-processing, the study employed several “word embedding” techniques. Word embedding is a method that transforms words into numerical vectors, allowing computers to understand the semantic relationships between words. The techniques used were One-Hot Encoding, Word2Vec, and Term Frequency-Inverse Document Frequency (TF-IDF).
Understanding Word Embedding Techniques
One-Hot Encoding: This is a straightforward method where each unique word is represented as a binary vector. While simple, it can lead to very large datasets if there are many unique words, and it doesn’t capture semantic relationships between words.
Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF assigns a numerical weight to each word in a document, reflecting how important a word is to a document in a collection. It considers both how frequently a word appears in a specific document (Term Frequency) and how rare it is across all documents (Inverse Document Frequency). Words with high TF-IDF scores are considered more significant to the document’s content.
Word2Vec: This is a more advanced technique that creates dense vector representations of words. It learns the meaning and relationships of words by analyzing their context within large text datasets. Word2Vec has two main architectures: Continuous Bag of Words (CBOW), which predicts a word based on its surrounding words, and Skip-Gram, which predicts surrounding words given a target word. Word2Vec is particularly effective at capturing semantic similarities, meaning words with similar meanings will have similar vector representations.
After applying these word embedding techniques, the prepared data was fed into various machine learning models for classification. The models tested included K-Nearest Neighbors (KNN), Naive Bayes (NB), Random Forest (RF), Decision Trees (DT), Support Vector Machines (SVM), Logistic Regression (LR), and AdaBoost (AB).
Also Read:
- Beyond Prompts: A New Approach to Understanding Human Activities in Smart Homes
- Automating GitHub README Classification with Large Language Models
Key Findings
The research yielded insightful results regarding the effectiveness of different combinations of pre-processing, word embedding, and machine learning models. One significant observation was that the “stemming” pre-processing method consistently improved model performance across all word embedding techniques. This suggests that reducing words to their root forms helps the models better understand the underlying meaning.
For models using the TF-IDF technique, the Support Vector Machine (SVM) model achieved the highest success rate and F-Score of 0.8. SVM also generally outperformed other models in other pre-processing combinations when TF-IDF was used.
When Word2Vec was employed, Logistic Regression (LR) and SVM models showed the most promising results. Logistic Regression achieved the highest success and F-Score of 0.72 for the 0010 pre-processing combination (no lowercase, no punctuation removal, stemming, no stopwords removal). This indicates that Logistic Regression effectively utilized the word vectors generated by Word2Vec to represent text more accurately.
For datasets using the One-Hot Encoding method, the Naive Bayes (NB) model, along with Logistic Regression, achieved the highest success rates of 0.81 and 0.80 respectively, particularly with the 1010 and 1011 pre-processing combinations (lowercase conversion, no punctuation removal, stemming, with or without stopwords removal).
Overall, the study found that while pre-processing techniques generally improved performance, their specific combinations did not lead to drastic changes in success rates. The choice of word embedding technique and the machine learning model had a more pronounced impact. This comprehensive comparison of various natural language processing techniques and their combinations provides valuable guidance for future research in this domain. For more detailed information, you can refer to the full research paper available at this link.


