AI Bridging the Knowledge Gap in Particle Accelerator Documentation

TLDR: This research explores applying Large Language Models (LLMs) to extract, summarize, and organize information from particle accelerator technical documentation. Faced with legacy systems and retiring experts, the study uses a Retrieval-Augmented Generation (RAG) pipeline to create a chatbot that can answer technical questions. Key findings include the effectiveness of smaller text chunks and translation for multilingual documents, while also identifying limitations in processing non-textual data. The work demonstrates LLMs’ potential in preserving institutional knowledge and ensuring continuity in specialized scientific fields.

In highly specialized fields like particle accelerator operations, a wealth of critical knowledge is often locked away in decades-old technical documents, further complicated by the retirement of experienced personnel. This creates a significant challenge for knowledge transfer and preservation. A recent research paper explores how Large Language Models (LLMs) can be applied to tackle this very issue, automating and enhancing the extraction of vital information from these complex technical documents.

The paper, titled “Application of Large Language Models for the Extraction of Information from Particle Accelerator Technical Documentation,” highlights the urgent need for efficient methods to retain specialized knowledge. Facilities like the High Intensity Proton Accelerator (HIPA) at the Paul Scherrer Institut (PSI), designed 50 years ago, suffer from sparse and inconsistent documentation. While newer facilities like Proscan have better records, the expertise often resides with individuals, taking years for new specialists to acquire.

Leveraging LLMs for Knowledge Retention

The core idea is to develop an LLM-based chatbot that can simulate the ability to chat with an expert, providing answers to technical questions based on existing documentation. This approach aims to significantly speed up the knowledge transfer process for new personnel. The researchers focused on beam instrumentation of HIPA and Proscan as a case study, developing a locally running system to ensure the security of internal documentation.

The Documents and Methodology

The initial dataset comprised 58 PDF files, a mix of English and German, spanning 30 years of technical specifics for HIPA and Proscan. It also included master theses, conference proceedings, and publicly available books. A key challenge identified was the missing information in non-textual formats like Excel tables, schematics, and databases, which the current system cannot yet process.

The methodology employed is Retrieval-Augmented Generation (RAG), a technique that combines the broad language capabilities of LLMs with an external knowledge base to improve factual accuracy. All data processing and model runs were performed locally using Ollama on a Mac Studio M2 Ultra. The RAG pipeline involves two main stages:

Pre-processing: PDF documents are parsed, extracting text, equations, and tables. These are then split into smaller chunks, and their embeddings (numerical representations) are stored in a vector database along with a file ID.
Runtime: When a user submits a query, it’s also embedded. The system then retrieves the top-k most similar chunks from the vector database. These retrieved chunks, along with the original query, are fed into an instruction-tuned LLM (specifically, gemma3:27b-it-fp16) to generate the final answer.

A chatbot interface was developed, not only providing answers but also listing the five most relevant files with similarity scores and snippets, allowing users to click and open the exact location in the context.

Key Findings and Recommendations

The researchers conducted extensive evaluations using 100 expert-created question-answer pairs. They tested various chunking strategies (character windows, paragraph windows) and found that smaller chunks (800 characters) generally outperformed larger ones in retrieval performance (recall and MRR). Interestingly, translating German chunks to English significantly boosted performance for both German and English queries, likely by reducing multilingual noise.

For generation, translation also showed benefits, especially with fewer retrieved chunks. The models generally exhibited high confidence in their answers. A critical observation was that larger chunk sizes (1600 characters with Top-5 retrieval) could lead to hallucinations—summarizing text out of context or ignoring the query—due to the LLM’s context window limitations. Increasing the context window helped mitigate this.

The study recommends using 800-character chunks with Top-5 retrieval for the highest answer accuracy and confidence in this specific domain, underscoring the importance of chunk size, retrieval depth (Top-k), and prompt structure.

Also Read:

Future Directions

While the current RAG pipeline shows strong performance with textual content, a significant limitation remains in extracting information from non-textual elements like tables, figures, schematics, and photos. Future work will explore pre-processing techniques such as automatic figure captioning to generate descriptive text for visual content, enabling its retrieval through the existing RAG system. Additionally, the evaluation of multi-modal embedding models that can directly encode figure content into the vector database is planned.

This work highlights the immense potential of LLMs in preserving institutional knowledge and ensuring continuity in highly specialized fields, with anticipated applications for other PSI facilities like SLS and SwissFEL. You can read the full research paper here: Application of Large Language Models for the Extraction of Information from Particle Accelerator Technical Documentation.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI Bridging the Knowledge Gap in Particle Accelerator Documentation

Leveraging LLMs for Knowledge Retention

The Documents and Methodology

Key Findings and Recommendations

Future Directions

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates