Adapting Vision Foundation Models for Robust
Cloud Segmentation in Remote Sensing Images

1Beijing Jiaotong University, 2Qinghai University,
3Tsinghua University, 4Beijing University of Posts and Telecommunications


arXiv 2024

*Equal Contribution, Corresponding Author
teaser

Unlike previous methods, where the entire network is fully trainable, our Cloud-Adapter approach leverages a frozen vision foundation model (VFM) combined with a lightweight, trainable adapter. This design preserves generalization and adaptability, enabling efficient learning for cloud segmentation.

Abstract

Cloud segmentation is a critical challenge in remote sensing image interpretation, as its accuracy directly impacts the effectiveness of subsequent data processing and analysis. Recently, vision foundation models (VFM) have demonstrated powerful generalization capabilities across various visual tasks. In this paper, we present a parameter-efficient adaptive approach, termed Cloud-Adapter, designed to enhance the accuracy and robustness of cloud segmentation. Our method leverages a VFM pretrained on general domain data, which remains frozen, eliminating the need for additional training. Cloud-Adapter incorporates a lightweight spatial perception module that initially utilizes a convolutional neural network (ConvNet) to extract dense spatial representations. These multi-scale features are then aggregated and serve as contextual inputs to an adapting module, which modulates the frozen transformer layers within the VFM. Experimental results demonstrate that the Cloud-Adapter approach, utilizing only 0.6\% of the trainable parameters of the frozen backbone, achieves substantial performance gains. Cloud-Adapter consistently attains state-of-the-art (SOTA) performance across a wide variety of cloud segmentation datasets from multiple satellite sources, sensor series, data processing levels, land cover scenarios, and annotation granularities. We have released the source code and pretrained models at GitHub to support further research.

Method

Framework: Detailed network architecture of the proposed Cloud-Adapter method, consisting of the spatial perception and adapting modules. The spatial perception module uses ConvNet blocks to extract dense spatial features, which are aggregated into a multi-scale context and fed to the adapting module. The adapting module modulates the frozen transformer layers in the VFM.

Results

Visualization on the Six Datasets: HRC, GF1, GF2, L1C, L2A, and L8B.

Clear Sky
Thick Cloud
Thin Cloud
Cloud Shadow

Quantitative Comparison (mIoU, %, ↑) with the Existing Methods.



Method Binary-Class Cloud Segmentation Multi-Class Cloud Segmentation
HRC GF1 GF2 L1C L2A L8B
SCNN 57.22 81.68 76.99 22.75 28.76 32.38
CDNetv1 77.79 81.82 78.20 60.35 62.39 34.58
CDNetv2 76.75 84.93 78.84 65.60 66.05 43.63
MCDNet 53.50 85.16 78.36 44.80 46.52 33.85
UNetMobv2 79.91 91.71 80.44 71.65 70.36 47.76
DBNet 77.78 91.36 78.68 65.52 65.65 51.41
HRCloudNet 83.44 91.86 75.57 68.26 68.35 43.51
KappaMask 67.48 92.42 72.00 41.27 45.28 42.12
Cloud-Adapter 89.05 92.55 83.02 74.18 73.38 57.53

BibTeX

@misc{cloud-adapter,
      title={Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images}, 
      author={Xuechao Zou and Shun Zhang and Kai Li and Shiying Wang and Junliang Xing and Lei Jin and Congyan Lang and Pin Tao},
      year={2024},
      eprint={2411.13127},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.13127}, 
}

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62072027.