TLDR: ConnectomeBench evaluates LLMs on three connectomics proofreading tasks: segment identification, split error correction, and merge error detection. LLMs show high accuracy in segment identification (52-82%) and split error correction (75-85%) but struggle with merge error detection. While not yet matching human experts, the results suggest LLMs could significantly assist or replace human proofreading in mapping brain connections.
Mapping the intricate network of neural connections in an organism’s brain, a field known as connectomics, is a monumental task. Currently, a significant amount of human effort is dedicated to “proofreading” the vast datasets generated from brain imaging and machine-learning assisted segmentation. This manual correction process is a major bottleneck, with some projects, like the complete fruit fly connectome, requiring an estimated 33 human years of proofreading. The exciting advancements in AI, particularly large language models (LLMs), have opened up possibilities for automating such complex scientific tasks.
A new study introduces ConnectomeBench, a benchmark designed to evaluate how well current AI systems can perform the critical proofreading tasks necessary for connectomics. This benchmark assesses multimodal LLM capabilities across three key areas: identifying segment types, correcting split errors, and detecting merge errors. The researchers used expertly annotated data from two extensive open-source datasets: a cubic millimeter of mouse visual cortex and the entire Drosophila brain.
Understanding the Proofreading Challenges
The process of creating a connectome involves several steps. First, high-resolution imaging techniques like electron microscopy are used to capture many “slices” of brain tissue. These slices are then aligned and stacked to create a 3D imaging volume. Next, a segmentation algorithm is applied to this volume to identify individual components like neurons, non-neuronal cells, and blood vessels. However, both the imaging data and the segmentation algorithms are imperfect, leading to errors.
These errors fall into two main categories: split errors and merge errors. Split errors occur when parts of a single neuron are incorrectly separated. Merge errors happen when segments from multiple neurons are mistakenly combined. Human experts then meticulously review and correct these errors using specialized graphical user interfaces.
ConnectomeBench: Three Key Tasks
ConnectomeBench evaluates LLMs on three fundamental proofreading tasks:
- Segment type identification: This involves classifying segmented structures into categories such as single neurons, merged neurons, neuronal processes without a cell body, nuclei, or non-neuronal cells.
- Split error correction: Here, the LLM must determine if two separated segments should actually be merged because they belong to the same neuron.
- Merge error identification: This task requires the LLM to detect instances where segments from multiple neurons have been incorrectly joined together.
The benchmark leverages the multimodal capabilities of LLMs by presenting them with images of 3D segmentation data. Their performance is then assessed through both binary classification (yes/no) and multiple-choice evaluations.
Key Findings: Promising but Room for Improvement
The study evaluated several proprietary multimodal LLMs, including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, and GPT-4o, as well as open-source models like InternVL-3 and NVLM. The results showed that current models achieved surprisingly high performance in segment identification, with balanced accuracies ranging from 52% to 82% (compared to a 20-25% chance level). They also performed well in binary and multiple-choice split error correction, achieving 75-85% accuracy (compared to a 50% chance).
However, the models generally struggled with merge error identification tasks. While the best models still lag behind expert human performance, their demonstrated capabilities are promising. The researchers suggest that these AI systems could eventually augment and potentially replace human proofreading in connectomics.
One interesting finding was that providing additional descriptive context in the prompts did not always significantly improve the performance of proprietary models for segment identification, suggesting these models already possess strong internal visual reasoning capabilities. For split error correction, however, adding descriptive information significantly improved performance for most models in the multiple-choice format.
Furthermore, the study explored the use of “heuristics” derived from analyzing LLM reasoning patterns. By incorporating these heuristics into the prompts, performance on both binary and multiple-choice split error correction tasks improved across almost all models. This highlights the potential of using LLMs’ natural language reasoning to understand and address their own limitations.
Also Read:
- Mapping Grammar to Neurons: Insights from Llama 3 and the Human Brain
- Enhancing Alzheimer’s Detection with Explicit Knowledge in Language Models
The Future of Connectomics Proofreading
ConnectomeBench provides a standardized method for evaluating LLM capabilities in connectome proofreading, establishing a baseline for current models and identifying areas for future development. While there are still challenges, particularly with merge error identification, the progress shown by LLMs in visual reasoning suggests a future where AI agents could significantly reduce the human effort required for connectome creation. For more details, you can read the full research paper here.


