spot_img
HomeResearch & DevelopmentComparing Human and AI Language: Introducing AI Brown and...

Comparing Human and AI Language: Introducing AI Brown and AI Koditex

TLDR: Researchers Jiˇ r´ ı Miliˇ cka, Anna Marklov´ a, and V´ aclav Cvrˇ cek have developed two new large language model (LLM)-generated text corpora, AI-Brown (English) and AI Koditex (Czech). These corpora are designed to be directly comparable to traditional human-written corpora, enabling detailed linguistic analysis of AI-generated text versus human text, and comparisons across various LLMs. Generated using models from major AI developers and extensively annotated, these freely available resources address the need for systematic data in corpus linguistics to study AI language characteristics like formulaicity, stylistic variability, and lexical diversity.

In the rapidly evolving landscape of artificial intelligence, understanding the nuances of language generated by large language models (LLMs) is crucial. A new research paper introduces two groundbreaking corpora, AI-Brown and AI Koditex, which offer an unprecedented resource for comparing human-written texts with those produced by LLMs in both English and Czech. This initiative aims to bridge a significant gap in linguistic research, providing systematically collected and annotated data for in-depth analysis.

The motivation behind these new corpora stems from the difficulty researchers face in finding suitable LLM-generated data that is directly comparable to human-authored texts. Traditional linguistic studies often adapt psycholinguistic experiments, but corpus linguistics, with its quantitative and methodology-driven approach, has been underexplored in this domain. The AI-Brown and AI Koditex corpora address this by replicating the design principles of established human-created corpora: the BE21 corpus for English and the Koditex corpus for Czech, both rooted in the Brown Corpus tradition.

How the Corpora Were Created

To ensure comparability, the researchers employed a clever generation procedure. They took existing human-written texts from the BE21 and Koditex corpora and split each text chunk. The first 500 words served as a prompt for the LLMs, while the remaining text was reserved as a human-authored reference for evaluation. This method ensures that the models receive sufficient context for generation while providing a direct human counterpart for comparison.

A wide array of frontier and widely-used LLMs were employed for text generation, including models from OpenAI (GPT-3, GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, GPT-4.5), Anthropic (Claude 3 Opus, Claude 3 Haiku, Claude 3.5 Sonnet), Alphabet (Gemini 1.5 Pro, Gemini 2.0 Pro, Gemini 2.0 Flash), Meta (Llama 3.1 405B), and DeepSeek (DeepSeek-V3). The selection prioritized closed models accessible via APIs, as these are often modified or discontinued, making their contemporaneous capture vital for historical archiving. Open-access models like Llama-3.1 and DeepSeek-V3 were also included due to their significant public interest.

The generation process involved two main sampling temperatures: T=0 for deterministic output and T=1 for more variable, probabilistic generation. For instruction-tuned models, minimal system prompts were used to guide them towards continuing the text rather than summarizing or answering questions. A notable challenge arose with Czech generation, where some models required English system prompts and occasionally produced mixed-language outputs despite explicit instructions.

Processing and Annotation

After generation, the raw JSON outputs were meticulously processed. This included cleaning the texts to remove meta-preambles (like “Certainly! Here is a continuation in a similar style:”) and refusal rationales from instruction-tuned models. The texts were then linguistically annotated according to the Universal Dependencies standard, providing tokenization, lemmatization, morphological, and syntactic information. These annotated texts are available in CoNLL-U and verticalized formats, making them compatible with widely used corpus engines and interfaces.

Key Findings and Insights

The research revealed several interesting patterns in LLM-generated language:

  • Text Coherence: While most English generations were coherent, weaker models and Czech generations often struggled, exhibiting issues like thematic drift or repetitive loops, especially at deterministic temperatures (T=0). Older models like davinci-002 and Llama 3.1 in Czech were particularly prone to these issues.

  • Formulaicity: LLM-generated texts showed a higher degree of formulaicity compared to human texts. For instance, the phrase “a testament to the” was significantly more frequent in AI-Brown than in the human BE21 corpus, primarily emerging from instruction-tuned systems across different companies. In Czech, certain 4-grams were highly repetitive, especially in outputs from Gemini models.

  • Stylistic Variability: The study explored whether LLMs could produce stylistically diverse texts. It found that base models generally performed better than instruction-tuned ones in mimicking stylistic variations, and performance was superior in English, likely due to its dominance in training data. While models showed distinct stylistic shifts, some dimensions (e.g., narrativity) were easier to approximate than others.

  • Lexical Diversity: As hypothesized, higher sampling temperatures (T=1) generally led to greater lexical diversity and richer vocabularies compared to deterministic generation (T=0), a pattern consistent across most models.

Also Read:

Accessibility and Future Development

The AI-Brown and AI Koditex corpora are freely available for download under open licenses from the LINDAT/CLARIAH-CZ linguistic repository. They are also searchable through the Czech National Corpus’s KonText interface, allowing researchers to query by lemmas, morphological features, and syntactic relations. The project aims to continuously expand these corpora, serving as a dynamic archive and a “museum” of large language models as new ones emerge and older ones disappear or are modified.

This initiative significantly lowers the barrier for researchers from various linguistic fields to study AI-generated language, enabling them to test hypotheses without the extensive effort of assembling ad hoc datasets. For more details, you can refer to the full research paper: AI Brown and AI Koditex: LLM-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -