Mapping facade materials utilizing zero-shot segmentation for applications in urban microclimate research

Zero-shot models
To address the challenges of urban facade material segmentation, this study employs a series of state-of-the-art zero-shot learning and foundation models, known for their robust capabilities in computer vision tasks. These models, originally designed for vast and diverse datasets, are adapted here to encode and extract patterns specific to urban material classification without the necessity for extensive labeled training data.
-
OpenAI CLIP: utilized for its zero-shot classification capabilities, leveraging text and image embeddings to match facade materials with textual descriptions25. Despite its proficiency in material recognition through text prompts, CLIP does not inherently localize material regions within images, necessitating supplementary mechanisms for detailed segmentation.
-
CLIPSeg: deployed to enhance CLIP’s capabilities, this model uses attention mechanisms to facilitate low-resolution segmentation of materials in images, highlighting specific weighted areas of the image that significantly influence classification decisions26.
-
SAM (segment anything model): acts as a foundational tool for clustering surface types within images27. Although it does not specify material types that belong to the same class, it effectively segregates different surface areas and bounding regions, providing a reliable segmentation base.
-
Grounding DINO: this zero-shot framework complements the above models by optimizing object detection28. It iteratively finds the optimal bounding box containing the object of interest that is prompted through text inputs. It is capable of recognizing specific objects in an image but cannot process ubiquitous surfaces that are not explicitly concentrated in an object-like fashion. Once an object is detected, panoptic segmentation may be used to define its exact boundaries as opposed to simply drawing a bounding box29.
Recent research efforts have focused on merging these models for unsupervised segmentation30,31. These innovative approaches, while promising, currently best suit simple scenes with minimal textures and are challenging for complex building facades. They tend to perform less effectively in delineating textures not confined within specific image areas. Panoptic segmentation offers advancements in delineating their image region contours, by carrying out pixel-wise segmentation while unifying the typically distinct tasks of semantic segmentation (pixel-level class labels) and instance segmentation (object instance detection and segmentation). Yet it primarily focuses on distinct well-clustered objects rather than dispersed textures, indicating a gap in addressing the nuanced requirements of building facade material segmentation. Few-shot learning (for both classification, detection, and segmentation), presents an extended approach, providing an ability to learn from a small dataset (2–30)32. This provides an alternative avenue, particularly when combined with fine-tuning large foundation models. However, this area of research is in active development and may exhibit catastrophic forgetting as well as domain bias33.
While zero-shot foundation models excel at their respective tasks they require further adaptation and tuning to match respective problem domains. Recent advancements in zero-shot learning have significantly expanded its capabilities across various domains. Notable among these are the developments in attention mechanisms integrated into models like CLIPSeg, which improve segmentation by focusing on the most relevant regions of an image, particularly useful in complex urban environments with diverse materials. These mechanisms help refine localization, addressing the challenges of delineating textures that are dispersed across a facade. While zero-shot models reduce the dependency on labeled data, they may still struggle with fine-grained textures or rare materials, whereas supervised approaches might excel in these areas but at the cost of requiring extensive labeled datasets and manual intervention.
Another key advancement is the use of semantic embeddings, which allow models to link visual information with high-level semantic descriptions. This technique has been widely employed in zero-shot object detection, where models can detect unseen object classes by leveraging relationships between known and unknown categories34. In remote sensing, such semantic embeddings have enabled land cover classification with minimal supervision in remote sensing applications35,36, and in industrial material recognition, they have facilitated the accurate detection of novel materials in complex manufacturing settings as well as in textile applications37. The use of multimodal zero-shot learning in robotics is also noteworthy, where tactile texture recognition is performed without prior tactile training samples by combining visual and semantic information. This method has demonstrated the ability to classify previously unseen materials through tactile sensing, highlighting the versatility of zero-shot learning in recognizing complex material properties across domains beyond vision-based tasks37. These technical developments are particularly relevant to urban material segmentation, where the heterogeneity of facade materials and the limited availability of annotated datasets pose significant challenges. By incorporating attention-based mechanisms and semantic embeddings, we improve the detection and segmentation accuracy in diverse urban environments, enabling more scalable and flexible urban material mapping.
Hence, the application of foundation zero-shot frameworks to urban surface mapping remains nascent. The method outlined in this paper aims to combine and extend existing zero-shot approaches into a joint workflow that specifically addresses the needs of the urban scientist. Specifically, we hope to leverage the strengths of the different models and assess their applicability to the different segmentation sub-tasks. Given the ever-evolving nature of computer vision frameworks, our goal is not to offer a definitive way to segment materials but rather to demonstrate the power of zero-shot models for urban material applications and offer one potential open-source strategy.
Workflow overview: image sourcing, detection and segmentation
The first step in the workflow is image sourcing and processing. The target images are those that show direct views of the facade with minor perspectival distortions, as these are the views that typically yield the highest clarity. StreetView 36038 was used to download high-resolution urban panoramas and subsequently warp the panorama to obtain corresponding frontal views of building facades. Panoramic images were selected for their capability to capture multiple buildings at once, increasing the efficiency of large-scale urban material analysis. It is also possible to implement multi-view integration to improve segmentation accuracy across different perspectives. We consequently use a semantic segmentation network trained on the ADE20K dataset to filter out all elements not related to urban facades such as streets and vegetation before we carry out façade segmentation. This is detailed in Fig. 1.

Workflow overview for Image sourcing, transformation and joint detection-segmentation of facade regions (materials) and objects (elements). This has two parts: image sourcing: A schematic detailing the preparation of street view panoramas for subsequent segmentation, utilized in the preparation of the validation dataset and subsequent neighborhood-scale material mapping. Detection and Segmentation: a layout of the segmentation workflow. At the stage of image fragment classification, we explore two approaches: (1) Utilizing CLIP to classify an image patch (2) Using CLIPSeg to calculate the class triggering the most attention within the identified object mask.
In the datasets we collected, we identified the following types of materials present on facades, based on predominant construction classes, typically defined in thermal simulation studies; (1) Brick (2) Concrete (3) Glazing (4) Roof tiles (5) Metal panels (6) Wood siding (7) Plaster/stucco. It is important to equally detect façade objects, such as balconies and Air Conditioners for further analysis as they may also contribute to local microclimate conditions- for instance, balconies provide shading and ACs release heat into urban canyons contributing to anthropogenic heat. Therefore, these objects are detected initially and isolated from the regional segmentation task. The object classes are as follows; (1) AC (2) Storefront (3) Balcony (4) Door. We detect these object-based surfaces with Grounding DINO and feed them into the final panoptic segmentation prediction before we execute material segmentation. To account for facade occlusions, we rejected images with significant obstructions from our dataset and incorporated a process that automatically treats non-material classes, such as cars or lamps, as occlusions, ensuring only relevant materials are segmented. Additional automated methods, such as inpainting and object removal can also be integrated39.
To tackle the challenge of high texture complexity in the material segmentation task, we propose to apply zero-shot classification to separate smaller patches within the facade. To do so, we first parse the facade into separate segments using SAM. Next, we crop each segment out of the image and classify it as belonging to one of the material labels using OpenAI CLIP which distinguishes between material textures by using contextual information from an image patch, padded by twice the width and height of its bounding box. To improve detection inference, a low-level segmentation is performed provided by CLIPSeg that leverages regions of attention to specific materials. As noted previously, CLIPSeg is adapted to finding regions of attention in data from a base 225 × 225 transformer mask which makes it unsuitable for performing segmentation on large and complex scenes. However, given that our image fragment constitutes most of the data needed, it is sufficient for distinguishing which classes draw attention within the classification mask. To allocate a material class to the target patch, we compute the normalized sum of CLIPSeg attention that only falls within the boundaries of the mask and return the label with the highest result. Figure 1 details the full pipeline. With reference to detecting material classes, we compared the detection capabilities of OpenAI CLIP and CLIPSeg. CLIPSeg showed higher accuracy in detection across the seven material classes analyzed, hence it was selected for our workflow. In the glazing condition, two distinct approaches were adopted detecting them as both objects (distinct windows) and surfaces (fully glazed facades). This dual strategy allows for a more thorough representation.
For the segmentation process, we utilized image resolutions of either 1000 × 1000 or 912 × 912 pixels, depending on the complexity of the scene, and set the classification patch size to 1.3 times the bounding box around each detected element to improve material classification accuracy. For SAM, we fine-tuned several hyperparameters- specifically, we increased the number of points per side to 64 to achieve finer segmentation, set the prediction Intersection over Union (IoU) threshold to 0.75 for boosted accuracy, and raised the stability score threshold to 0.75 for more reliable segmentation results. CLIPSeg tuning involved crafting precise text prompts to accurately capture material classes, as some materials required multiple or more specific descriptions to improve segmentation accuracy. For example, ‘plaster’ and ‘stucco’ were both used to describe stucco walls, while ‘concrete’ was labeled as ‘exposed concrete’ for clarity. These prompt refinements, referred to as prompt engineering, are detailed in Table 1 in the supplementary data and explored further in the paper for their contribution to prediction accuracy. While the pretrained segmentation models are efficient during inference, computational complexity increases with image resolution and segmentation tasks. Our segmentation runtime depends on the performance of integrated models and includes: (1) semantic segmentation with ADE20K (0.4 ± 0.6 s/image); (2) patch detection with SAM (13.4 ± 1.5 s); (3) iterative classification of all patches across image with either OpenAI CLIP (3.9 ± 2.1 s) or CLIPSEG (23.8 ± 9.8 s). The total segmentation runtime is either 17.6 s (OpenAI Clip) or 37.6 s (CLIPSeg) per image, as run on a single NVIDIA GeForce RTX 2080 Ti. The panoptic segmentation of windows and doors required us to run GroundingDINO on CPU, which added an additional 20.4 ± 0.2 s per image.
Model assessment
The evaluation of our urban facade material segmentation model was systematically conducted in three distinct phases to ensure robustness and applicability across different urban settings and scales. To explore the performance trade-off between zero-shot and pretrained models, we have also included a comparison with SegFormer40, a state-of-the-art segmentation network that we trained on our dataset.
Close-range image testing
The initial phase of testing involved a dataset comprising 393 close-range material segmentation images in light industrial environments (LIB-HSI)22. This dataset was specifically chosen to validate the model’s ability to accurately recognize and classify a wide variety of building materials under controlled conditions where texture delineations are more prominent. Besides the RGB images, it provides corresponding infrared images and is primarily targeted at exploring accurate material segmentation via joint RGB and infrared inputs. Here we compare the performance of our algorithm to the segmentation network trained on the RGB portion of the dataset.
Cross-city architectural representation testing
Subsequently, the model was tested on a more diverse set of images to evaluate its performance across varied architectural styles and urban environments. To explore the applications to diverse urban contexts we collected 144 facade images (~ 50 from each city) from three cities: Boston (North America), Amsterdam (Europe) and Dubai (Asia). The intention was to obtain representative buildings in each material class. For instance, Amsterdam is characterized by a larger proportion of brick construction in its building stock, while some areas of Boston have higher proportions of single-family wood construction. These images were manually labeled using Segments.ai41 and will be used as ground truth when computing the assessment scores. This selection aimed to evaluate the model’s adaptability and accuracy in different global contexts. Alongside our model’s evaluation, the trained SegFormer model was also tested on this dataset to benchmark the accuracy of our approach across different urban contexts. Key model parameters used during training included seven material classes, a learning rate of 0.00006, and 10 epochs with a batch size of 10. The training set consisted of 144 images, which was split into 5 folds; 4/5 folds were the training set and 1/5 was the validation set.
Large-scale urban deployment
In the final phase, the model was deployed in a sample area in our assessment cities, measuring roughly 1 km2 in area. This deployment aimed to assess the model’s effectiveness in mapping material distributions at a neighborhood scale. The focus was to showcase the spatial distribution of materials and their prevalence within typical urban blocks, providing insights into urban material distributions and potential impact on urban heat island effects. The assessment points each represent two panorama images with two viewpoints (0 and 180°). Within the spatially defined areas, about 1200 images were captured per city. The number of points per city varied slightly due to varying urban built-up densities.
Evaluation metrics
The first metric computed is a domain specific computer vision metric that assesses the extent of segmentation accuracy. The weighted Intersection over Union (IoU) is an adaptation of the standard IoU metric used extensively in image segmentation tasks to evaluate the overlap between predicted segmentation masks and ground truth masks. This modified metric is particularly relevant to material segmentation where the presence of various materials may be unevenly distributed across an image, influencing their impact on performance metrics. In the context of material segmentation, where different materials can significantly vary in their surface coverage and impact on the building’s thermal properties, using a weighted IoU ensures that the evaluation metric aligns with the practical implications of correctly or incorrectly segmenting each material. The weighted IoU is reported per material class and also as an aggregate score using Eq. (1). We also document precision and recall scores to measure positive predictions42.
$$Weighted\,IoU = \frac{{\sum\nolimits_{i = 1}^{n} {w_{i} x\left( {IoU_{i} } \right)} }}{{\sum\nolimits_{i = 1}^{n} {w_{i} } }}$$
(1)
where IoUi is the IoU for each material class and wi is the weight based on computed pixel ratio of the class.
Additionally, we implement a material presence threshold to check predicted material classes that occupy only a few pixels of the image ensuring that minor false positives do not skew the evaluation.
To advance the interpretability further of the detected facade materials, we propose two material-specific metrics that report different granularities. Each of these metrics provides a different insight through which the effectiveness and comprehensiveness of material segmentation can be assessed, to tackle more performance domain specific applications. This approach is particularly relevant in areas where both the detection (presence) of objects or materials and the accuracy of their spatial localization are critical for the application domain- this is only relevant to the cross-city facade validation. This allows researchers and urban planners to focus on precision where it matters most, ensuring that critical material knowledge is derived. They are as follows:
Predominant material class
This metric identifies the material that covers the largest area of the building’s facade, indicating the primary material. This could potentially be used to infer construction and surface coverage for higher level studies.
$$\Pr edo\min ant Material = m \in \arg \max_{m} A_{m}$$
(2)
where Am is the area covered by the material m and M is the set of all detected materials on the facade.
Material presence
This metric measures the completeness of material detection by calculating the percentage of correctly identified materials present on the facade. We utilize this metric to check the presence of the top three material classes present in the facade with the largest area coverage.
$$Material \Pr esence = \left( {\frac{N \det ected}{{N total}}} \right) \times 100\%$$
(3)
where Ndetected is the number of materials correctly detected on the facade, and Ntotal is the total number of different materials actually present on the facade.
UTCI impact measurement
To demonstrate the impact of material coverage in our studied cities, the Universal Thermal Climate Index (UTCI) was calculated to compute the annual thermal comfort, heat stress and cold stress hours in a test urban canyon. The UTCI is a measure that integrates the effects of air temperature, wind speed, humidity, and MRT to evaluate human thermal exposure in outdoor environments. MRT was sourced from the building surfaces, representing the average temperature of all surfaces surrounding the midpoint in the canyon area, weighted by the angle of exposure and emissivity of the assigned material. MRT and UTCI were calculated utilizing ClimateStudio43, a plugin for environmental studies which provided hourly surface temperatures based on physical models of solar radiation, surface properties, and environmental conditions, utilizing the EnergyPlus solver44. The canyon geometry measured 5 m in width and 9 m average building height. The adjacent enclosing walls were North and South facing. Nine scenarios were simulated to study the effect of three material classes (brick, glass, and wood) in the three selected cities (Dubai, Amsterdam, and Boston). We assume an 80% facade glazing coverage in the glass setup and model the wood scenario as light-colored siding. The material properties utilized are provided in Fig. 5 along with the results. All other geometric and simulation-based settings were kept constant across the cases. The outcomes helped in understanding the impact of material properties on the surface temperatures and subsequent human thermal comfort levels.
link