Preprint Platforms Expose Sensitive Data, Study Finds

TLDR: A new study, “You Have Been LaTeXpOsEd,” analyzed 100,000 arXiv submissions (1.2 TB of data) to uncover sensitive information leaks in preprint archives. Using a four-stage framework combining pattern matching, logical filtering, and large language models (LLMs), researchers found thousands of PII leaks, GPS-tagged EXIF files, publicly accessible cloud storage links, private SharePoint links, GitHub/Google credentials, cloud API keys, and confidential author communications. The study highlights significant security risks in scientific publishing and urges immediate action from researchers and repository operators to implement better sanitization and security measures.

The rapid pace of scientific discovery has been greatly accelerated by preprint repositories like arXiv, allowing researchers to share their findings almost immediately. However, a recent study titled “You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models” reveals a significant, often overlooked, security risk associated with these platforms. Beyond just the final PDF, these archives often provide unrestricted access to original source materials, including LaTeX files, auxiliary code, figures, and embedded comments, which can inadvertently disclose sensitive information.

Authored by Richard A. Dubniczky, Bertalan Borsos, and Norbert Tihanyi from Eötvös Loránd University, ZEISS Digital Innovation, and the Technology Innovation Institute, this groundbreaking research presents the first large-scale security audit of preprint archives. The team analyzed over 1.2 TB of source data from 100,000 arXiv submissions, uncovering a startling array of hidden disclosures.

The LaTeXpOsEd Framework

To conduct this extensive audit, the researchers introduced LaTeXpOsEd, a sophisticated four-stage framework. This methodology integrates traditional pattern matching, logical filtering, and advanced large language models (LLMs) to systematically identify sensitive data within non-referenced files and LaTeX comments. The stages include: Scraping (collecting data from arXiv’s Amazon S3 storage), Parsing (structuring and cleaning source files, extracting LaTeX comments, and checking EXIF metadata in images), Data Mining (applying pattern matching and LLM-based extraction), and Analyzing (categorizing and interpreting findings).

Uncovering Secrets with Traditional Methods

Initial analysis using conventional tools and pattern-matching techniques revealed a substantial amount of exposed information. Researchers extracted approximately 42,500 unique URLs and IP addresses from LaTeX comments. More critically, they identified over 700 links with token-like paths, suggesting direct access without further authentication. This included more than 30 CASA tokens, which grant free access to scientific publications, and nearly 650 links providing view or edit access to private files and folders on popular cloud services like Google Drive, Dropbox, and SharePoint. These leaked documents ranged from peer-review materials and internal correspondence to unreleased datasets, private letters, and spreadsheets containing PII or experimental measurements that sometimes differed from published results. The study also found AWS secret access keys, over 90 IBANs, P.O. box addresses, and phone numbers not intended for public release.

The Power of Large Language Models

The most significant and sophisticated findings emerged from the LLM-based analysis. Traditional tools like TruffleHog proved largely ineffective at detecting genuine secrets in comments and files, often flagging false positives due to their inability to interpret context. In contrast, LLMs, particularly Qwen-2.5 72B, excelled at understanding contextual nuances, leading to the discovery of real credential leaks, internal sensitive author discussions, and other highly sensitive data that would be impossible to detect with simpler methods.

The Qwen-2.5 72B model flagged 9,926 papers, resulting in a total of 13,806 detections. The primary categories of sensitive information identified by LLMs included: Personally Identifiable Information (PII) with 6,984 instances, Peer-review related content (3,386 instances), Author Conflicts (3,283 instances), Network Identifiers (106 instances), and critically, 47 instances of Credentials (passwords, API keys). These credentials were often accompanied by the original login page URL, significantly escalating the risk. The researchers highlighted that such an extensive LLM-based analysis could be performed for as little as $50, making it accessible to malicious actors.

Also Read:

Ethical Considerations and Recommendations

The researchers adhered to strict ethical guidelines, anonymizing all sensitive data and examples, and directly notifying affected authors when critical exposures were confirmed. Their goal was to highlight systemic risks, not to expose individuals.

Based on their findings, the study provides urgent recommendations for both authors and preprint platforms:

For Authors: Use tools like arXiv LaTeX Cleaner to remove comments and extraneous files, avoid uploading credentials to collaborative platforms, refrain from sharing cloud storage materials with edit permissions via signed URLs, use separate credentials for projects, and regularly audit project directories for sensitive data.
For Platforms: Provide clear warnings about public file availability, implement automated sanitization procedures (e.g., removing comments and unused files), consider withholding original source packages, and integrate credential and sensitive-data scanners at upload time.

The study concludes that approximately 10% of the analyzed papers contained sensitive disclosures, with about 0.2% including critical information like login credentials and API tokens. This research underscores the urgent need for stronger community awareness, improved author practices, and systematic interventions at the platform level to safeguard privacy and security in open science.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Preprint Platforms Expose Sensitive Data, Study Finds

The LaTeXpOsEd Framework

Uncovering Secrets with Traditional Methods

The Power of Large Language Models

Ethical Considerations and Recommendations

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Rubrik Report Reveals Alarming Decline in Cyber Resilience Amidst AI Agent Proliferation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates