TLDR: Researchers Jiˇ r´ ı Miliˇ cka, Anna Marklov´ a, and V´ aclav Cvrˇ cek have developed two new large language model (LLM)-generated text corpora, AI-Brown (English) and AI Koditex (Czech). These corpora are designed to be directly comparable to traditional human-written corpora, enabling detailed linguistic analysis of AI-generated text versus human text, and comparisons across various LLMs. Generated using models from major AI developers and extensively annotated, these freely available resources address the need for systematic data in corpus linguistics to study AI language characteristics like formulaicity, stylistic variability, and lexical diversity.
In the rapidly evolving landscape of artificial intelligence, understanding the nuances of language generated by large language models (LLMs) is crucial. A new research paper introduces two groundbreaking corpora, AI-Brown and AI Koditex, which offer an unprecedented resource for comparing human-written texts with those produced by LLMs in both English and Czech. This initiative aims to bridge a significant gap in linguistic research, providing systematically collected and annotated data for in-depth analysis.
The motivation behind these new corpora stems from the difficulty researchers face in finding suitable LLM-generated data that is directly comparable to human-authored texts. Traditional linguistic studies often adapt psycholinguistic experiments, but corpus linguistics, with its quantitative and methodology-driven approach, has been underexplored in this domain. The AI-Brown and AI Koditex corpora address this by replicating the design principles of established human-created corpora: the BE21 corpus for English and the Koditex corpus for Czech, both rooted in the Brown Corpus tradition.
How the Corpora Were Created
To ensure comparability, the researchers employed a clever generation procedure. They took existing human-written texts from the BE21 and Koditex corpora and split each text chunk. The first 500 words served as a prompt for the LLMs, while the remaining text was reserved as a human-authored reference for evaluation. This method ensures that the models receive sufficient context for generation while providing a direct human counterpart for comparison.
A wide array of frontier and widely-used LLMs were employed for text generation, including models from OpenAI (GPT-3, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.5), Anthropic (Claude 3 Opus, Claude 3 Haiku, Claude 3.5 Sonnet), Alphabet (Gemini 1.5 Pro, Gemini 2.0 Pro, Gemini 2.0 Flash), Meta (Llama 3.1 405B), and DeepSeek (DeepSeek-V3). The selection prioritized closed models accessible via APIs, as these are often modified or discontinued, making their contemporaneous capture vital for historical archiving. Open-access models like Llama-3.1 and DeepSeek-V3 were also included due to their significant public interest.
The generation process involved two main sampling temperatures: T=0 for deterministic output and T=1 for more variable, probabilistic generation. For instruction-tuned models, minimal system prompts were used to guide them towards continuing the text rather than summarizing or answering questions. A notable challenge arose with Czech generation, where some models required English system prompts and occasionally produced mixed-language outputs despite explicit instructions.
Processing and Annotation
After generation, the raw JSON outputs were meticulously processed. This included cleaning the texts to remove meta-preambles (like “Certainly! Here is a continuation in a similar style:”) and refusal rationales from instruction-tuned models. The texts were then linguistically annotated according to the Universal Dependencies standard, providing tokenization, lemmatization, morphological, and syntactic information. These annotated texts are available in CoNLL-U and verticalized formats, making them compatible with widely used corpus engines and interfaces.
Key Findings and Insights
The research revealed several interesting patterns in LLM-generated language:
-
Text Coherence: While most English generations were coherent, weaker models and Czech generations often struggled, exhibiting issues like thematic drift or repetitive loops, especially at deterministic temperatures (T=0). Older models like davinci-002 and Llama 3.1 in Czech were particularly prone to these issues.
-
Formulaicity: LLM-generated texts showed a higher degree of formulaicity compared to human texts. For instance, the phrase “a testament to the” was significantly more frequent in AI-Brown than in the human BE21 corpus, primarily emerging from instruction-tuned systems across different companies. In Czech, certain 4-grams were highly repetitive, especially in outputs from Gemini models.
-
Stylistic Variability: The study explored whether LLMs could produce stylistically diverse texts. It found that base models generally performed better than instruction-tuned ones in mimicking stylistic variations, and performance was superior in English, likely due to its dominance in training data. While models showed distinct stylistic shifts, some dimensions (e.g., narrativity) were easier to approximate than others.
-
Lexical Diversity: As hypothesized, higher sampling temperatures (T=1) generally led to greater lexical diversity and richer vocabularies compared to deterministic generation (T=0), a pattern consistent across most models.
Also Read:
- Beyond Mirroring: How Large Language Models Invent New Social Biases
- Unmasking Confident Errors: Spurious Correlations Challenge LLM Hallucination Detection
Accessibility and Future Development
The AI-Brown and AI Koditex corpora are freely available for download under open licenses from the LINDAT/CLARIAH-CZ linguistic repository. They are also searchable through the Czech National Corpus’s KonText interface, allowing researchers to query by lemmas, morphological features, and syntactic relations. The project aims to continuously expand these corpora, serving as a dynamic archive and a “museum” of large language models as new ones emerge and older ones disappear or are modified.
This initiative significantly lowers the barrier for researchers from various linguistic fields to study AI-generated language, enabling them to test hypotheses without the extensive effort of assembling ad hoc datasets. For more details, you can refer to the full research paper: AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts.


