Comparing Human and AI Language: Introducing AI Brown and AI Koditex

TLDR: Researchers Jiˇ r´ ı Miliˇ cka, Anna Marklov´ a, and V´ aclav Cvrˇ cek have developed two new large language model (LLM)-generated text corpora, AI-Brown (English) and AI Koditex (Czech). These corpora are designed to be directly comparable to traditional human-written corpora, enabling detailed linguistic analysis of AI-generated text versus human text, and comparisons across various LLMs. Generated using models from major AI developers and extensively annotated, these freely available resources address the need for systematic data in corpus linguistics to study AI language characteristics like formulaicity, stylistic variability, and lexical diversity.

In the rapidly evolving landscape of artificial intelligence, understanding the nuances of language generated by large language models (LLMs) is crucial. A new research paper introduces two groundbreaking corpora, AI-Brown and AI Koditex, which offer an unprecedented resource for comparing human-written texts with those produced by LLMs in both English and Czech. This initiative aims to bridge a significant gap in linguistic research, providing systematically collected and annotated data for in-depth analysis.

The motivation behind these new corpora stems from the difficulty researchers face in finding suitable LLM-generated data that is directly comparable to human-authored texts. Traditional linguistic studies often adapt psycholinguistic experiments, but corpus linguistics, with its quantitative and methodology-driven approach, has been underexplored in this domain. The AI-Brown and AI Koditex corpora address this by replicating the design principles of established human-created corpora: the BE21 corpus for English and the Koditex corpus for Czech, both rooted in the Brown Corpus tradition.

How the Corpora Were Created

To ensure comparability, the researchers employed a clever generation procedure. They took existing human-written texts from the BE21 and Koditex corpora and split each text chunk. The first 500 words served as a prompt for the LLMs, while the remaining text was reserved as a human-authored reference for evaluation. This method ensures that the models receive sufficient context for generation while providing a direct human counterpart for comparison.

A wide array of frontier and widely-used LLMs were employed for text generation, including models from OpenAI (GPT-3, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.5), Anthropic (Claude 3 Opus, Claude 3 Haiku, Claude 3.5 Sonnet), Alphabet (Gemini 1.5 Pro, Gemini 2.0 Pro, Gemini 2.0 Flash), Meta (Llama 3.1 405B), and DeepSeek (DeepSeek-V3). The selection prioritized closed models accessible via APIs, as these are often modified or discontinued, making their contemporaneous capture vital for historical archiving. Open-access models like Llama-3.1 and DeepSeek-V3 were also included due to their significant public interest.

The generation process involved two main sampling temperatures: T=0 for deterministic output and T=1 for more variable, probabilistic generation. For instruction-tuned models, minimal system prompts were used to guide them towards continuing the text rather than summarizing or answering questions. A notable challenge arose with Czech generation, where some models required English system prompts and occasionally produced mixed-language outputs despite explicit instructions.

Processing and Annotation

After generation, the raw JSON outputs were meticulously processed. This included cleaning the texts to remove meta-preambles (like “Certainly! Here is a continuation in a similar style:”) and refusal rationales from instruction-tuned models. The texts were then linguistically annotated according to the Universal Dependencies standard, providing tokenization, lemmatization, morphological, and syntactic information. These annotated texts are available in CoNLL-U and verticalized formats, making them compatible with widely used corpus engines and interfaces.

Key Findings and Insights

The research revealed several interesting patterns in LLM-generated language:

Text Coherence: While most English generations were coherent, weaker models and Czech generations often struggled, exhibiting issues like thematic drift or repetitive loops, especially at deterministic temperatures (T=0). Older models like davinci-002 and Llama 3.1 in Czech were particularly prone to these issues.
Formulaicity: LLM-generated texts showed a higher degree of formulaicity compared to human texts. For instance, the phrase “a testament to the” was significantly more frequent in AI-Brown than in the human BE21 corpus, primarily emerging from instruction-tuned systems across different companies. In Czech, certain 4-grams were highly repetitive, especially in outputs from Gemini models.
Stylistic Variability: The study explored whether LLMs could produce stylistically diverse texts. It found that base models generally performed better than instruction-tuned ones in mimicking stylistic variations, and performance was superior in English, likely due to its dominance in training data. While models showed distinct stylistic shifts, some dimensions (e.g., narrativity) were easier to approximate than others.
Lexical Diversity: As hypothesized, higher sampling temperatures (T=1) generally led to greater lexical diversity and richer vocabularies compared to deterministic generation (T=0), a pattern consistent across most models.

Also Read:

Accessibility and Future Development

The AI-Brown and AI Koditex corpora are freely available for download under open licenses from the LINDAT/CLARIAH-CZ linguistic repository. They are also searchable through the Czech National Corpus’s KonText interface, allowing researchers to query by lemmas, morphological features, and syntactic relations. The project aims to continuously expand these corpora, serving as a dynamic archive and a “museum” of large language models as new ones emerge and older ones disappear or are modified.

This initiative significantly lowers the barrier for researchers from various linguistic fields to study AI-generated language, enabling them to test hypotheses without the extensive effort of assembling ad hoc datasets. For more details, you can refer to the full research paper: AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Comparing Human and AI Language: Introducing AI Brown and AI Koditex

How the Corpora Were Created

Processing and Annotation

Key Findings and Insights

Accessibility and Future Development

Gen AI News and Updates

Anthropic’s Claude AI Expands Financial Capabilities with Excel Integration and Real-Time Data Connectors

Leading Foreign Automakers Secure China’s Nod for In-Car AI Chatbots

AI’s Dual Impact: Google Expands AI in India Amidst DeepSeek’s Job Displacement Warning

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates