Unmasking AI-Generated Text: A Stylometric Approach with Boosted Trees

TLDR: This paper introduces StylOch, a system for detecting machine-generated text using gradient-boosted trees and frequency-based stylometric features. It leverages a large dataset of over 500,000 texts and linguistic annotations from spaCy to extract thousands of features. While effective and explainable, the system’s performance is notably impacted by text obfuscation, suggesting future work in data augmentation and feature engineering.

In an era where Large Language Models (LLMs) are increasingly prevalent, the ability to distinguish between human-written and machine-generated text (MGT) has become crucial. This challenge is particularly relevant in professional fields like academia, medicine, and journalism, where issues such as plagiarism and factual accuracy are paramount. Addressing this urgent need, a recent research paper titled “StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features” delves into a robust method for AI text detection. You can find the full paper here: RESEARCH_PAPER_URL.

Understanding MGT Detection Methods

The landscape of MGT detection is diverse, encompassing various approaches. Some methods rely on analyzing terms, while others use perplexity or logit statistics, which often require access to the LLM generator itself. Another technique, watermarking, involves embedding an imperceptible signature within the generated text. However, the authors of this paper opted for a “black-box” approach, focusing on methods that do not require direct access to the LLM’s internal workings. Their work builds upon previous findings that simple, non-neural classifiers, particularly those utilizing stylometric features, can be surprisingly effective, even outperforming more complex neural networks in certain scenarios.

The StylOch System: A Deep Dive

The core of the StylOch system lies in its modular design, combining gradient-boosted tree models with sophisticated feature engineering and a massive training dataset. The team did not specifically target obfuscation techniques, which are strategies used to make machine-generated text harder to detect.

Extensive Training Data

Recognizing the importance of comprehensive training, the researchers amassed a substantial corpus of over 500,000 text samples. These samples were drawn from a variety of openly accessible datasets designed for MGT detection benchmarks, including those from PAN’25, AuTexTification, CHEAT, HC3, MAGE, Multitude, and M4. This diverse collection, spanning genres like essays, news, reviews, abstracts, and Q&A, aimed to ensure the model’s robustness and generalizability across different text types.

Sophisticated Stylometric Features

Instead of relying on a fixed set of features, the StylOch system programmatically generates features based on linguistic analysis. It leverages public spaCy models for text preprocessing, which includes tasks like tokenization, named entity recognition, and part-of-speech tagging. From these linguistic annotations, the system extracts thousands of features, primarily focusing on the normalized frequencies of lemmas (base forms of words, from single words to three-word phrases), part-of-speech tags (e.g., noun, verb, adjective, up to four-tag sequences, including punctuation), dependency-based bigrams (capturing relationships between words based on their grammatical connections), and morphological annotations (describing grammatical properties of words, and entity types like “PERSON” or “LOCATION”). These features are designed to capture subtle stylistic patterns. For instance, the presence of redundant whitespace characters, often a human mistake or an artifact of LLM processing, can be detected through punctuation features, demonstrating the explainable nature of this approach.

The Classifier: Light Gradient-Boosting Machine

For classification, the researchers employed the Light Gradient-Boosting Machine (LGBM), a state-of-the-art boosted trees classifier known for its efficiency. They used Scikit-learn for feature counting and cross-validation, a technique that helps in obtaining more reliable estimates of the model’s performance. The model’s capacity was varied across different submissions (small, medium, big) to explore its impact on detection performance, with larger capacities generally leading to better results.

Evaluation and Results

The StylOch system was evaluated as part of Subtask 1 “AI Detection Sensitivity” at the PAN: Voight-Kampff Generative AI Detection 2025 task, using the TIRA platform for reproducible submissions. Performance was measured using several metrics, including ROC-AUC, Brier score, C@1, F1 score, and F0.5u. The evaluation showed that increasing the model’s capacity and using cross-validation generally led to higher scores on validation datasets. However, the results also highlighted a significant challenge: obfuscation, which is the intentional alteration of text to evade detection, considerably reduced the detection performance. While the StylOch model did not surpass the best baseline (a TF-IDF based classifier) in all metrics, its performance was competitive, especially considering its non-neural and explainable nature.

Also Read:

Future Directions

The paper concludes with several avenues for future improvement. The authors suggest that boosted trees have the capacity to learn from an even larger number of features, implying benefits from incorporating TF-IDF features or standardizing feature frequencies, which have proven effective in stylometry. Augmenting the training set with obfuscated samples could also enhance robustness. Further hyperparameter optimization for both feature sets and LGBM parameters is another promising direction. The main computational cost of their method lies in feature extraction on the large dataset, while classifier training and inference remain inexpensive. The researchers view their approach as a valuable trade-off, offering lower computational cost and greater explainability compared to some neural-based systems, while still striving for better generalization.