TLDR: The Scalable Image-to-3D Facade Parser (SI3FP) is a new pipeline that uses deep learning and computer vision to create detailed 3D thermal models of buildings from images. It offers two paths: one for scalable analysis using sparse data like Street View, and another for targeted, high-resolution modeling using dense camera images and Neural Radiance Fields (NeRF). By correcting perspective distortions and directly modeling geometric primitives in orthographic images, SI3FP accurately detects windows and estimates the Window-to-Wall Ratio (WWR) with about 5% error, making it a practical tool for early-stage energy renovation planning and urban development.
Understanding and improving the energy efficiency of existing buildings is a crucial step in addressing climate change. A significant challenge in this effort is the lack of detailed 3D models of older buildings, especially those that include specific features like windows, which are vital for accurate energy simulations. Traditional methods for creating these models are often expensive, time-consuming, and not easily scalable for large numbers of buildings.
A new research paper introduces the Scalable Image-to-3D Facade Parser (SI3FP), a novel pipeline designed to generate detailed 3D thermal models of buildings. These models are at a Level of Detail (LoD) 3, meaning they include important features like windows, which are essential for precise energy renovation planning. The SI3FP system leverages both computer vision and deep learning techniques to extract geometric information directly from images.
Unlike previous approaches that rely on segmenting images and then projecting those segments into 3D, SI3FP directly models geometric shapes, such as rectangles for windows, within a special type of image called an orthographic image. Orthographic images are unique because they correct for perspective distortions, ensuring that objects maintain their true scale and shape regardless of their distance from the camera. This provides a consistent and accurate interface for analysis.
The SI3FP pipeline offers two main pathways to accommodate different data availability scenarios. The “StreetView” path is designed for scalable inspection, utilizing readily available, sparse data like Google Street View images. This path includes steps for collecting and filtering panoramic images, clustering associated 3D planes (representing building surfaces), aligning these images to improve robustness, and finally detecting and cropping facades. A key innovation here is an ensemble method that combines information from multiple overlapping views to overcome issues like occlusions and varying viewpoints.
The second pathway, “Camera2D,” is tailored for targeted, high-resolution inspection. This involves collecting a dense set of photographs of a specific building. It uses advanced techniques like Structure-from-Motion (SfM) to reconstruct the 3D structure and estimate camera positions, and Neural Radiance Fields (NeRF) to create highly realistic 3D renderings of the building. From these detailed 3D models, true orthographic images are generated, providing a precise representation of the facade.
Once the orthographic facade images are generated by either path, the system moves to a merged step: semantic facade parsing. Here, a pre-trained deep learning model (ResNet-50 RetinaNet) is used to accurately detect the location and size of each window on the facade. If multiple images of the same facade are available, the system employs a fusion method to combine detections, enhancing reliability and consistency. The detected window dimensions are then translated into real-world measurements using the scale information derived from the initial data collection.
The final step involves 3D thermal modeling. The detected windows, along with the facade geometry and available building footprint information, are used to reconstruct a complete 3D model of the building in a standardized format called HoneybeeJSON. This model can then be used for energy simulations to evaluate potential renovation alternatives and support decision-making for building owners.
Experiments conducted on typical Swedish residential buildings from the 1960s and 70s demonstrated the effectiveness of SI3FP. The system achieved an approximate 5% error in Window-to-Wall Ratio (WWR) estimation, which is considered sufficient for early-stage renovation analysis. While the Camera2D path generally showed better performance due to more controlled data acquisition, the StreetView path proved highly scalable and cost-effective. The research highlights the trade-offs between data density, equipment complexity, time efficiency, and cost, making SI3FP a versatile tool for large-scale energy renovation planning and urban development.
Also Read:
- Making Large AI Image Models Accessible: A Hierarchical Approach to Compression
- GuirlVG: A Reinforcement Learning Approach for Efficient GUI Visual Grounding
For more in-depth information, you can refer to the full research paper: Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models.


