Decoding AI's Writing Style: A Benchmark for Stylistic Variation in LLM-Generated Texts

TLDR: This research introduces a benchmark using Biber’s Multidimensional Analysis to evaluate how well LLMs mimic human writing styles in English and Czech. It finds that LLMs exhibit consistent stylistic shifts, often performing better in English than in Czech, and that base models sometimes outperform instruction-tuned ones. The study highlights the impact of prompts and the potential for a stylistic divide where English benefits more from AI advancements than underrepresented languages.

Large Language Models (LLMs) have become incredibly powerful, excelling at tasks like programming and problem-solving. However, a recent study delves into a less explored area: how well these AI models can mimic the rich and diverse stylistic variations found in human writing. The research, titled “Benchmark of stylistic variation in LLM-generated texts,” investigates whether LLMs truly understand and can reproduce the nuances of different writing styles, or if they tend to gravitate towards a generic “AI-language.”

The study, conducted by Jiˇ r´ ı Miliˇ cka, Anna Marklov´ a, and V´ aclav Cvrˇ cek, addresses several key questions. These include how effectively LLMs generate stylistically diverse texts across genres, if there’s a consistent “AI-language stylistic attractor,” how instruction-tuned models compare to base models, the impact of different prompts, and whether stylistic features depend on sampling temperature. Crucially, it also examines if the stylistic shift is smaller in English compared to languages underrepresented in training data.

Unpacking Stylistic Differences with Multidimensional Analysis

To measure stylistic variation, the researchers employed Biber’s Multidimensional Analysis (MDA), a well-established method in corpus linguistics. MDA identifies underlying stylistic dimensions based on co-occurring linguistic features. For English, they used a framework with six dimensions (e.g., Involved vs. Informational Production, Narrative vs. Non-narrative Discourse). For Czech, an adapted model with eight dimensions was used. This approach allows for an interpretable comparison, moving beyond simply counting word frequencies to understanding broader stylistic shifts.

The study utilized two new LLM-generated corpora: AI-Brown for English and AI-Koditex for Czech. These were designed to be directly comparable to human-written texts. The process involved taking the first 500 words of original human texts as prompts for LLMs to continue, with the second part of the original text serving as a reference. The stylistic consistency was then measured by comparing the LLM-generated continuations to the human-written references.

English Language Findings: A Mixed Bag of Stylistic Imitation

In English, the models showed a tendency to shift texts in similar directions. For example, many models moved texts towards “informational production” (less involved) and “explicit reference” (more direct, less situation-dependent). Some stylistic dimensions, like “narrative vs. non-narrative discourse” and “on-line elaboration of information,” were relatively easy for models to reproduce. However, “situation-dependent reference vs. explicit reference” proved consistently difficult, with many models showing strong shifts towards explicit reference.

Interestingly, the study found that sampling temperature (which controls randomness in text generation) did not have a significant systematic effect on stylistic variation. Models with different temperature settings often clustered together stylistically. More notably, older base models like davinci-002 and LLaMA 3.1 Completion often performed better stylistically than their instruction-tuned counterparts, suggesting that additional tuning might sometimes flatten stylistic diversity. The type of prompt also played a significant role; using the leaked ChatGPT default system prompt led to a distinct stylistic profile compared to a minimalist prompt.

Czech Language Findings: The Challenge of Underrepresentation

The results for Czech painted a different picture, highlighting the challenges LLMs face with languages less represented in their training data. Compared to English, there was a much higher overall stylistic shift in Czech texts generated by LLMs. This indicates that models struggled more to mimic the original human styles in Czech. Some models even failed to stylistically imitate the corpus across all eight dimensions, a phenomenon not observed in English. Furthermore, several models were unable to generate coherent Czech text at all.

Similar to English, the shifts in Czech tended to move in consistent directions across most models, with dimensions like “higher level of cohesion,” “general/intension,” and “prospective” generally shifting towards their positive poles, while others like “polythematic,” “higher amount of addressee coding,” and “attitudinal” shifted negatively. The study also noted that no truly functional base model exists for Czech, further emphasizing the English-centric nature of current LLMs.

Also Read:

Implications for the Future of Language

The research concludes by establishing a valuable stylistic benchmark for evaluating LLMs. It confirms the existence of “AI-language stylistic attractors” – consistent patterns towards which most models gravitate. The findings underscore that while LLMs can produce stylistically diverse texts, there are substantial differences between models and genres. Users often need to experiment with different models and prompts to achieve desired stylistic outcomes.

A significant broader implication is the potential for a “stylistic divide” between English and underrepresented languages. English, already dominant in LLM training data, stands to benefit further from AI tools that perform best in it, potentially leading to even greater stylistic alignment with human writing. Conversely, smaller languages risk being sidelined, drifting towards more “AI-specific” styles due to weaker LLM support and a feedback loop of less use and less training data. This study, available at arXiv:2509.10179, calls for continued monitoring of new models and replication of the study with other underrepresented languages to ensure a more balanced future for AI-assisted language generation.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Decoding AI’s Writing Style: A Benchmark for Stylistic Variation in LLM-Generated Texts

Unpacking Stylistic Differences with Multidimensional Analysis

English Language Findings: A Mixed Bag of Stylistic Imitation

Czech Language Findings: The Challenge of Underrepresentation

Implications for the Future of Language

Gen AI News and Updates

Comparing Human and AI Language: Introducing AI Brown and AI Koditex

Understanding Grammar in Language Models: A New Perspective on String Probability

A New Standard for Evaluating Long-Context AI Understanding in English and Arabic

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates