spot_img
HomeResearch & DevelopmentDecoding AI's Writing Style: A Benchmark for Stylistic Variation...

Decoding AI’s Writing Style: A Benchmark for Stylistic Variation in LLM-Generated Texts

TLDR: This research introduces a benchmark using Biber’s Multidimensional Analysis to evaluate how well LLMs mimic human writing styles in English and Czech. It finds that LLMs exhibit consistent stylistic shifts, often performing better in English than in Czech, and that base models sometimes outperform instruction-tuned ones. The study highlights the impact of prompts and the potential for a stylistic divide where English benefits more from AI advancements than underrepresented languages.

Large Language Models (LLMs) have become incredibly powerful, excelling at tasks like programming and problem-solving. However, a recent study delves into a less explored area: how well these AI models can mimic the rich and diverse stylistic variations found in human writing. The research, titled “Benchmark of stylistic variation in LLM-generated texts,” investigates whether LLMs truly understand and can reproduce the nuances of different writing styles, or if they tend to gravitate towards a generic “AI-language.”

The study, conducted by Jiˇ r´ ı Miliˇ cka, Anna Marklov´ a, and V´ aclav Cvrˇ cek, addresses several key questions. These include how effectively LLMs generate stylistically diverse texts across genres, if there’s a consistent “AI-language stylistic attractor,” how instruction-tuned models compare to base models, the impact of different prompts, and whether stylistic features depend on sampling temperature. Crucially, it also examines if the stylistic shift is smaller in English compared to languages underrepresented in training data.

Unpacking Stylistic Differences with Multidimensional Analysis

To measure stylistic variation, the researchers employed Biber’s Multidimensional Analysis (MDA), a well-established method in corpus linguistics. MDA identifies underlying stylistic dimensions based on co-occurring linguistic features. For English, they used a framework with six dimensions (e.g., Involved vs. Informational Production, Narrative vs. Non-narrative Discourse). For Czech, an adapted model with eight dimensions was used. This approach allows for an interpretable comparison, moving beyond simply counting word frequencies to understanding broader stylistic shifts.

The study utilized two new LLM-generated corpora: AI-Brown for English and AI-Koditex for Czech. These were designed to be directly comparable to human-written texts. The process involved taking the first 500 words of original human texts as prompts for LLMs to continue, with the second part of the original text serving as a reference. The stylistic consistency was then measured by comparing the LLM-generated continuations to the human-written references.

English Language Findings: A Mixed Bag of Stylistic Imitation

In English, the models showed a tendency to shift texts in similar directions. For example, many models moved texts towards “informational production” (less involved) and “explicit reference” (more direct, less situation-dependent). Some stylistic dimensions, like “narrative vs. non-narrative discourse” and “on-line elaboration of information,” were relatively easy for models to reproduce. However, “situation-dependent reference vs. explicit reference” proved consistently difficult, with many models showing strong shifts towards explicit reference.

Interestingly, the study found that sampling temperature (which controls randomness in text generation) did not have a significant systematic effect on stylistic variation. Models with different temperature settings often clustered together stylistically. More notably, older base models like davinci-002 and LLaMA 3.1 Completion often performed better stylistically than their instruction-tuned counterparts, suggesting that additional tuning might sometimes flatten stylistic diversity. The type of prompt also played a significant role; using the leaked ChatGPT default system prompt led to a distinct stylistic profile compared to a minimalist prompt.

Czech Language Findings: The Challenge of Underrepresentation

The results for Czech painted a different picture, highlighting the challenges LLMs face with languages less represented in their training data. Compared to English, there was a much higher overall stylistic shift in Czech texts generated by LLMs. This indicates that models struggled more to mimic the original human styles in Czech. Some models even failed to stylistically imitate the corpus across all eight dimensions, a phenomenon not observed in English. Furthermore, several models were unable to generate coherent Czech text at all.

Similar to English, the shifts in Czech tended to move in consistent directions across most models, with dimensions like “higher level of cohesion,” “general/intension,” and “prospective” generally shifting towards their positive poles, while others like “polythematic,” “higher amount of addressee coding,” and “attitudinal” shifted negatively. The study also noted that no truly functional base model exists for Czech, further emphasizing the English-centric nature of current LLMs.

Also Read:

Implications for the Future of Language

The research concludes by establishing a valuable stylistic benchmark for evaluating LLMs. It confirms the existence of “AI-language stylistic attractors” – consistent patterns towards which most models gravitate. The findings underscore that while LLMs can produce stylistically diverse texts, there are substantial differences between models and genres. Users often need to experiment with different models and prompts to achieve desired stylistic outcomes.

A significant broader implication is the potential for a “stylistic divide” between English and underrepresented languages. English, already dominant in LLM training data, stands to benefit further from AI tools that perform best in it, potentially leading to even greater stylistic alignment with human writing. Conversely, smaller languages risk being sidelined, drifting towards more “AI-specific” styles due to weaker LLM support and a feedback loop of less use and less training data. This study, available at arXiv:2509.10179, calls for continued monitoring of new models and replication of the study with other underrepresented languages to ensure a more balanced future for AI-assisted language generation.

Rhea Bhattacharya
Rhea Bhattacharyahttps://blogs.edgentiq.com
Rhea Bhattacharya is an AI correspondent with a keen eye for cultural, social, and ethical trends in Generative AI. With a background in sociology and digital ethics, she delivers high-context stories that explore the intersection of AI with everyday lives, governance, and global equity. Her news coverage is analytical, human-centric, and always ahead of the curve. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -