spot_img
HomeResearch & DevelopmentPreprint Platforms Expose Sensitive Data, Study Finds

Preprint Platforms Expose Sensitive Data, Study Finds

TLDR: A new study, “You Have Been LaTeXpOsEd,” analyzed 100,000 arXiv submissions (1.2 TB of data) to uncover sensitive information leaks in preprint archives. Using a four-stage framework combining pattern matching, logical filtering, and large language models (LLMs), researchers found thousands of PII leaks, GPS-tagged EXIF files, publicly accessible cloud storage links, private SharePoint links, GitHub/Google credentials, cloud API keys, and confidential author communications. The study highlights significant security risks in scientific publishing and urges immediate action from researchers and repository operators to implement better sanitization and security measures.

The rapid pace of scientific discovery has been greatly accelerated by preprint repositories like arXiv, allowing researchers to share their findings almost immediately. However, a recent study titled “You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models” reveals a significant, often overlooked, security risk associated with these platforms. Beyond just the final PDF, these archives often provide unrestricted access to original source materials, including LaTeX files, auxiliary code, figures, and embedded comments, which can inadvertently disclose sensitive information.

Authored by Richard A. Dubniczky, Bertalan Borsos, and Norbert Tihanyi from Eötvös Loránd University, ZEISS Digital Innovation, and the Technology Innovation Institute, this groundbreaking research presents the first large-scale security audit of preprint archives. The team analyzed over 1.2 TB of source data from 100,000 arXiv submissions, uncovering a startling array of hidden disclosures.

The LaTeXpOsEd Framework

To conduct this extensive audit, the researchers introduced LaTeXpOsEd, a sophisticated four-stage framework. This methodology integrates traditional pattern matching, logical filtering, and advanced large language models (LLMs) to systematically identify sensitive data within non-referenced files and LaTeX comments. The stages include: Scraping (collecting data from arXiv’s Amazon S3 storage), Parsing (structuring and cleaning source files, extracting LaTeX comments, and checking EXIF metadata in images), Data Mining (applying pattern matching and LLM-based extraction), and Analyzing (categorizing and interpreting findings).

Uncovering Secrets with Traditional Methods

Initial analysis using conventional tools and pattern-matching techniques revealed a substantial amount of exposed information. Researchers extracted approximately 42,500 unique URLs and IP addresses from LaTeX comments. More critically, they identified over 700 links with token-like paths, suggesting direct access without further authentication. This included more than 30 CASA tokens, which grant free access to scientific publications, and nearly 650 links providing view or edit access to private files and folders on popular cloud services like Google Drive, Dropbox, and SharePoint. These leaked documents ranged from peer-review materials and internal correspondence to unreleased datasets, private letters, and spreadsheets containing PII or experimental measurements that sometimes differed from published results. The study also found AWS secret access keys, over 90 IBANs, P.O. box addresses, and phone numbers not intended for public release.

The Power of Large Language Models

The most significant and sophisticated findings emerged from the LLM-based analysis. Traditional tools like TruffleHog proved largely ineffective at detecting genuine secrets in comments and files, often flagging false positives due to their inability to interpret context. In contrast, LLMs, particularly Qwen-2.5 72B, excelled at understanding contextual nuances, leading to the discovery of real credential leaks, internal sensitive author discussions, and other highly sensitive data that would be impossible to detect with simpler methods.

The Qwen-2.5 72B model flagged 9,926 papers, resulting in a total of 13,806 detections. The primary categories of sensitive information identified by LLMs included: Personally Identifiable Information (PII) with 6,984 instances, Peer-review related content (3,386 instances), Author Conflicts (3,283 instances), Network Identifiers (106 instances), and critically, 47 instances of Credentials (passwords, API keys). These credentials were often accompanied by the original login page URL, significantly escalating the risk. The researchers highlighted that such an extensive LLM-based analysis could be performed for as little as $50, making it accessible to malicious actors.

Also Read:

Ethical Considerations and Recommendations

The researchers adhered to strict ethical guidelines, anonymizing all sensitive data and examples, and directly notifying affected authors when critical exposures were confirmed. Their goal was to highlight systemic risks, not to expose individuals.

Based on their findings, the study provides urgent recommendations for both authors and preprint platforms:

  • For Authors: Use tools like arXiv LaTeX Cleaner to remove comments and extraneous files, avoid uploading credentials to collaborative platforms, refrain from sharing cloud storage materials with edit permissions via signed URLs, use separate credentials for projects, and regularly audit project directories for sensitive data.

  • For Platforms: Provide clear warnings about public file availability, implement automated sanitization procedures (e.g., removing comments and unused files), consider withholding original source packages, and integrate credential and sensitive-data scanners at upload time.

The study concludes that approximately 10% of the analyzed papers contained sensitive disclosures, with about 0.2% including critical information like login credentials and API tokens. This research underscores the urgent need for stronger community awareness, improved author practices, and systematic interventions at the platform level to safeguard privacy and security in open science.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -