SeedPrints: Tracing Large Language Models Back to Their Initialization Seed

TLDR: SeedPrints is a novel LLM fingerprinting method that identifies models based on unique, seed-dependent token selection biases present from their random initialization, even before training. Unlike existing post-hoc techniques, SeedPrints provides a robust, “Galtonian” fingerprint that persists throughout the entire training lifecycle, remains effective under diverse training data shifts, and is resilient to common parameter modifications like fine-tuning and quantization. This allows for reliable provenance verification and attribution of LLMs from their “birth” to deployment.

In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) have become ubiquitous. However, with their widespread use comes a critical challenge: verifying their origin and attributing them to their creators. This is where LLM fingerprinting comes into play, a method designed to identify, trace, and attribute these complex AI models. Traditional fingerprinting techniques for LLMs often rely on characteristics that emerge only after a model has undergone extensive training, such as specific patterns in their training dynamics, data exposure, or hyperparameters. These methods are akin to identifying a person by their learned behaviors rather than an innate trait.

A groundbreaking new research paper introduces a more fundamental and intrinsic approach called SeedPrints. This method proposes a “Galtonian” fingerprint for LLMs, drawing an analogy from biological fingerprints which are unique and permanent from birth. SeedPrints leverages the subtle, random biases present in an LLM from the very moment of its initialization – before any training even begins. This means that, much like a human fingerprint, an LLM’s unique identity is imprinted at its “birth” and persists throughout its entire lifecycle.

The Core Idea: Initialization Biases as a Unique Signature

The central observation behind SeedPrints is that even untrained LLMs exhibit reproducible token selection biases. When given uniform random inputs, a newly initialized model doesn’t predict the next token uniformly. Instead, it shows a distinct preference for certain tokens over others. These preferences are not random noise; they are stable and measurable, and crucially, they depend entirely on the specific random seed used for the model’s initialization. Think of it as a subtle “accent” or “tendency” that a model has from the very start, determined by its initial setup.

Remarkably, these seed-dependent biases are not erased by training. While training certainly reshapes a model’s overall prediction behavior, the underlying relative preferences among these “identity tokens” remain correlated with their initial state. This persistence allows SeedPrints to statistically detect a model’s lineage with high confidence, even as it undergoes various training stages and modifications.

How SeedPrints Works

The SeedPrints algorithm involves two main steps. First, it identifies a set of “identity indices” – essentially, the output dimensions (tokens or hidden states) that consistently receive the lowest scores across a series of random inputs. These are the tokens the model is “most unwilling” to predict, and their patterns are highly stable. Second, it performs a distribution correlation test. This test compares the preference distributions of these identity indices between a base model and a suspicious model. By measuring a rank-based correlation (like Kendall-Tau) and comparing it against an uncorrelated baseline, SeedPrints can determine if the two models share a common lineage with statistical significance.

Also Read:

Unprecedented Robustness and Verifiability

The researchers conducted extensive experiments on LLaMA-style and Qwen-style models, demonstrating SeedPrints’ superior capabilities:

Birth Verification: SeedPrints can reliably distinguish between models that differ only in their initialization seed, effectively separating them “at birth.” Existing methods often fail at this early stage.
Persistence Through Training: The fingerprint remains detectable and verifiable across all training stages, from initialization through full pretraining and beyond.
Immunity to Data Shifts: Unlike many prior techniques that can be misled by changes in training data distribution, SeedPrints remains effective even when models are continually trained on vastly different datasets (e.g., synthetic children’s stories versus programming code). This proves it tracks intrinsic identity, not just data similarity.
Robustness to Parameter Modifications: Evaluated against a comprehensive benchmark (LeafBench) covering realistic deployment scenarios, SeedPrints showed near-perfect performance. It maintained its effectiveness under various parameter-altering techniques such as instruction tuning, general fine-tuning, parameter-efficient fine-tuning (PEFT), quantization, model merging, and distillation. These are common practices that often distort or weaken signals for other fingerprinting methods.

The ability of SeedPrints to provide “birth-to-lifecycle” identity verification, akin to a biometric fingerprint, offers a powerful tool for model owners. It enables the detection of model theft, unauthorized reuse, and provides a verifiable link between a suspicious model and its original source. This intrinsic and persistent identity tracking is crucial for provenance verification and copyright auditing in the age of advanced AI. You can read the full research paper for more details here: SeedPrints Research Paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

SeedPrints: Tracing Large Language Models Back to Their Initialization Seed

The Core Idea: Initialization Biases as a Unique Signature

How SeedPrints Works

Unprecedented Robustness and Verifiability

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates