TLDR: The research paper introduces CLS-DM, a novel latent diffusion model designed for reconstructing 3D CT images from sparse 2D X-ray views. It addresses the challenge of aligning 2D X-ray features with 3D CT latent representations through a three-stage training process involving perceptual compression, contrastive learning for alignment, and an autoregressive-guided diffusion process. CLS-DM significantly improves CT reconstruction quality and detail compared to existing methods, offering a more efficient and clinically viable solution for medical imaging.
Computed Tomography (CT) scans are a cornerstone of modern clinical diagnosis, providing detailed 3D insights into the body. However, traditional CT imaging, which relies on a dense array of X-ray exposures, comes with significant drawbacks: it’s time-consuming and exposes patients to high levels of radiation. This has driven researchers to explore methods for reconstructing CT images from fewer X-ray views, known as sparse-view CT reconstruction, aiming to reduce costs and health risks.
Recent advancements in artificial intelligence, particularly with diffusion models like the Latent Diffusion Model (LDM), have shown great promise in 3D CT reconstruction. Yet, a key challenge persists: the fundamental difference between the 2D nature of X-ray images and the 3D nature of CT scans makes it difficult for standard LDMs to effectively align these different data types within their ‘latent space’ – a compressed, abstract representation of the data. This misalignment can hinder the learning process and lead to less accurate reconstructions.
To overcome this, a new approach called the Consistent Latent Space Diffusion Model (CLS-DM) has been proposed. This innovative model integrates a technique called cross-modal feature contrastive learning. In simple terms, this helps the model efficiently extract 3D information from 2D X-ray images and ensures that the latent representations of X-rays and CT scans are properly aligned. This alignment is crucial for the diffusion model to learn and reconstruct high-quality 3D CT images.
How CLS-DM Works: A Three-Stage Process
The CLS-DM operates through a carefully designed three-stage training framework:
The first stage focuses on ‘perceptual feature compression’. Here, the original 3D CT scan data is compressed from its raw ‘voxel space’ (think of it as a 3D grid of pixels) into a more compact ‘latent space’. This process aims to capture the essential high-dimensional features of the CT images while reducing redundant information, making subsequent computations more efficient.
The second stage is where the magic of ‘contrastive learning’ happens. This module is designed to align the features extracted from X-ray images with the latent space created in the first stage. Imagine teaching the model to recognize that a specific pattern in a 2D X-ray corresponds to a particular 3D structure in the CT latent space. This is achieved by minimizing the ‘distance’ between features of the same entity (e.g., an X-ray and a CT scan of the same patient) while maximizing the distance between features of different entities. To ensure that this alignment process doesn’t degrade the X-ray feature extraction capabilities, an ‘autoregressive’ mechanism guides the training of the conditional encoder, which is responsible for processing the X-ray images.
Finally, the third stage is the ‘conditional diffusion process’. With the latent spaces now aligned, the diffusion model uses the aligned X-ray features as a guiding condition to iteratively refine and generate the 3D CT image within the latent space. This process essentially reverses a controlled ‘noise’ addition, gradually revealing the detailed CT structure.
Also Read:
- COLI: A New Approach to Efficiently Compress Large Images with Neural Networks
- PanoDiff-SR: A New Method for Generating Realistic Dental X-rays
Enhanced Performance and Practicality
Experimental results demonstrate that CLS-DM significantly outperforms both classical and state-of-the-art generative models in terms of standard image quality metrics like PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index Measure) on widely used medical datasets such as LIDC-IDRI and CTSpine1K. Visually, the CT images reconstructed by CLS-DM show substantially more pronounced and accurate details compared to other methods, which often produce overly smooth or less precise results.
A key advantage of CLS-DM is its efficiency. While it incorporates a contrastive learning phase, the inference process (generating a CT scan from new X-rays) does not significantly increase computational complexity. Furthermore, the method strategically restricts the selection of X-ray views to common sagittal and coronal planes, which not only leads to higher-quality reconstructions but also offers a more feasible solution for clinical practice, as capturing X-rays from unconventional angles can be costly.
This methodology not only enhances the effectiveness and economic viability of sparse X-ray reconstructed CT but also holds potential for generalization to other cross-modal transformation tasks, such as text-to-image synthesis. The code for CLS-DM has been made publicly available to encourage further research and applications. You can find more details in the research paper.


