spot_img
HomeResearch & DevelopmentAI Bridging the Knowledge Gap in Particle Accelerator Documentation

AI Bridging the Knowledge Gap in Particle Accelerator Documentation

TLDR: This research explores applying Large Language Models (LLMs) to extract, summarize, and organize information from particle accelerator technical documentation. Faced with legacy systems and retiring experts, the study uses a Retrieval-Augmented Generation (RAG) pipeline to create a chatbot that can answer technical questions. Key findings include the effectiveness of smaller text chunks and translation for multilingual documents, while also identifying limitations in processing non-textual data. The work demonstrates LLMs’ potential in preserving institutional knowledge and ensuring continuity in specialized scientific fields.

In highly specialized fields like particle accelerator operations, a wealth of critical knowledge is often locked away in decades-old technical documents, further complicated by the retirement of experienced personnel. This creates a significant challenge for knowledge transfer and preservation. A recent research paper explores how Large Language Models (LLMs) can be applied to tackle this very issue, automating and enhancing the extraction of vital information from these complex technical documents.

The paper, titled “Application of Large Language Models for the Extraction of Information from Particle Accelerator Technical Documentation,” highlights the urgent need for efficient methods to retain specialized knowledge. Facilities like the High Intensity Proton Accelerator (HIPA) at the Paul Scherrer Institut (PSI), designed 50 years ago, suffer from sparse and inconsistent documentation. While newer facilities like Proscan have better records, the expertise often resides with individuals, taking years for new specialists to acquire.

Leveraging LLMs for Knowledge Retention

The core idea is to develop an LLM-based chatbot that can simulate the ability to chat with an expert, providing answers to technical questions based on existing documentation. This approach aims to significantly speed up the knowledge transfer process for new personnel. The researchers focused on beam instrumentation of HIPA and Proscan as a case study, developing a locally running system to ensure the security of internal documentation.

The Documents and Methodology

The initial dataset comprised 58 PDF files, a mix of English and German, spanning 30 years of technical specifics for HIPA and Proscan. It also included master theses, conference proceedings, and publicly available books. A key challenge identified was the missing information in non-textual formats like Excel tables, schematics, and databases, which the current system cannot yet process.

The methodology employed is Retrieval-Augmented Generation (RAG), a technique that combines the broad language capabilities of LLMs with an external knowledge base to improve factual accuracy. All data processing and model runs were performed locally using Ollama on a Mac Studio M2 Ultra. The RAG pipeline involves two main stages:

  • Pre-processing: PDF documents are parsed, extracting text, equations, and tables. These are then split into smaller chunks, and their embeddings (numerical representations) are stored in a vector database along with a file ID.
  • Runtime: When a user submits a query, it’s also embedded. The system then retrieves the top-k most similar chunks from the vector database. These retrieved chunks, along with the original query, are fed into an instruction-tuned LLM (specifically, gemma3:27b-it-fp16) to generate the final answer.

A chatbot interface was developed, not only providing answers but also listing the five most relevant files with similarity scores and snippets, allowing users to click and open the exact location in the context.

Key Findings and Recommendations

The researchers conducted extensive evaluations using 100 expert-created question-answer pairs. They tested various chunking strategies (character windows, paragraph windows) and found that smaller chunks (800 characters) generally outperformed larger ones in retrieval performance (recall and MRR). Interestingly, translating German chunks to English significantly boosted performance for both German and English queries, likely by reducing multilingual noise.

For generation, translation also showed benefits, especially with fewer retrieved chunks. The models generally exhibited high confidence in their answers. A critical observation was that larger chunk sizes (1600 characters with Top-5 retrieval) could lead to hallucinations—summarizing text out of context or ignoring the query—due to the LLM’s context window limitations. Increasing the context window helped mitigate this.

The study recommends using 800-character chunks with Top-5 retrieval for the highest answer accuracy and confidence in this specific domain, underscoring the importance of chunk size, retrieval depth (Top-k), and prompt structure.

Also Read:

Future Directions

While the current RAG pipeline shows strong performance with textual content, a significant limitation remains in extracting information from non-textual elements like tables, figures, schematics, and photos. Future work will explore pre-processing techniques such as automatic figure captioning to generate descriptive text for visual content, enabling its retrieval through the existing RAG system. Additionally, the evaluation of multi-modal embedding models that can directly encode figure content into the vector database is planned.

This work highlights the immense potential of LLMs in preserving institutional knowledge and ensuring continuity in highly specialized fields, with anticipated applications for other PSI facilities like SLS and SwissFEL. You can read the full research paper here: Application of Large Language Models for the Extraction of Information from Particle Accelerator Technical Documentation.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -