Dynamic Dictionary Learning for Remote Sensing
Image Segmentation

1Beijing Jiaotong University, 2Qinghai University,
3Tsinghua University


ICCV 2025

*Equal Contribution, Corresponding Author
teaser

Unlike traditional segmentation methods that rely solely on implicit feature learning, our dynamic dictionary learning framework explicitly constructs and refines class-aware semantic embeddings through iterative cross-attention. By progressively updating a semantic dictionary via multi-stage interactions between image features and dictionary entries, our approach effectively tackles intra-class heterogeneity and inter-class similarity. This adaptive mechanism not only preserves robust generalization but also fine-tunes the model to capture subtle morphological variations, leading to improved performance across diverse remote sensing scenarios.

Abstract

Remote sensing image segmentation faces persistent challenges in distinguishing morphologically similar categories and adapting to diverse scene variations. While existing methods rely on implicit representation learning paradigms, they often fail to dynamically adjust semantic embeddings according to contextual cues, leading to suboptimal performance in fine-grained scenarios such as cloud thickness differentiation. This paper introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core innovation lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated via multi-stage alternating cross-attention querying between image features and dictionary embeddings. This process enables adaptive representation learning tailored to input-specific characteristics, effectively resolving ambiguities in intra-class heterogeneity and inter-class homogeneity. To further enhance discriminability, a contrastive constraint is applied to the dictionary space, ensuring compact intra-class distributions while maximizing inter-class separability. Extensive experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods, particularly in two online test benchmarks (LoveDA and UAVid). Code is available at GitHub.

Method

Framework: Our network architecture is built upon three key modules: an encoder, a dictionary generator, and a decoder. The encoder extracts rich, multi-scale features from the input imagery, which are then channeled into the dictionary generator. Here, a static set of class-specific embeddings is transformed into a dynamic dictionary through a dedicated modulator that leverages attention mechanisms to capture input-dependent contextual cues. Finally, the decoder employs iterative alternating cross-attention between the dynamic dictionary and image features, progressively refining both to produce precise segmentation maps tailored for complex remote sensing tasks.

t-SNE

The figure illustrates the evolution of the dynamic dictionary on the Grass dataset via t-SNE visualization. As iterative optimization progresses from the initial state (D0) to the final state (D3), intra-class distances significantly shrink—from 33.61 to 28.51—while inter-class distances gradually expand, enhancing class discriminability. This validates the effectiveness of dynamic dictionary learning in improving semantic recognition for fine-grained segmentation tasks.

Results

Visualization on the Six Datasets: LoveDA, UAVid, Potsdam, Vaihingen, Cloud, and Grass.

Comparison of semantic segmentation performance on the LoveDA dataset.



Method Background Building Road Water Barren Forest Agriculture mIoU ↑
TransUNet 43.0 56.1 53.7 78.0 9.3 44.9 56.9 48.9
DC-Swin 41.3 54.5 56.2 78.1 14.5 47.2 62.4 50.6
UNetFormer 44.7 58.8 54.9 79.6 20.1 46.0 62.5 52.4
Hi-Resnet 46.7 58.3 55.9 80.1 17.0 46.7 62.7 52.5
AerialFormer 47.8 60.7 59.3 81.5 17.9 47.9 64.0 54.1
SFA-Net 48.4 60.3 59.1 81.9 24.1 46.2 64.0 54.9
Ours 47.6 61.2 59.1 81.6 23.8 48.8 64.8 55.3

Comparison of semantic segmentation performance on the UAVid dataset.



Method Clutter Building Road Tree Vegetation Moving Car Static Car Human mIoU ↑
DANet 64.9 58.9 77.9 68.3 61.5 59.6 47.4 9.1 60.6
ABCNet 67.4 86.4 81.2 79.9 63.1 69.8 48.4 13.9 63.8
BANet 66.7 85.4 80.7 78.9 62.1 69.3 52.8 21.0 64.6
SegFormer 66.6 86.3 80.1 79.6 62.3 72.5 52.5 28.5 66.0
UNetFormer 68.4 87.4 81.5 80.2 63.5 73.6 56.4 31.0 67.8
SFA-Net 70.2 89.0 82.7 80.8 64.6 77.5 67.5 30.7 70.4
Ours 71.0 89.7 83.2 82.1 66.1 75.0 59.0 41.4 70.9

Comparison of semantic segmentation performance on the Potsdam dataset.



Method Impervious Surface Building Low Vegetation Tree Car mF1 ↑
DANet 91.0 95.6 86.1 87.6 84.3 88.9
ABCNet 93.5 96.9 87.9 89.1 95.8 92.7
Segmenter 91.5 95.3 85.4 85.0 88.5 89.2
BANet 93.3 96.7 87.4 89.1 96.0 92.5
SwinUperNet 93.2 96.4 87.6 88.6 95.4 92.2
DC-Swin 94.2 97.6 88.6 96.3 96.3 93.3
UNetFormer 93.6 97.2 87.7 90.6 96.5 93.5
AerialFormer 95.5 98.1 89.8 90.8 97.5 94.1
SFA-Net 95.0 97.5 88.3 90.2 97.1 93.5
Ours 96.1 97.9 90.6 91.2 97.5 94.7

Comparison of semantic segmentation performance on the Vaihingen dataset.



Method Impervious Surface Building Low Vegetation Tree Car mF1 ↑
DANet 90.0 93.9 82.2 87.3 44.5 79.6
ABCNet 92.7 95.2 84.5 89.7 85.3 89.5
BANet 92.2 95.2 83.8 89.9 86.8 89.6
Segmenter 89.8 93.0 81.2 88.9 67.6 84.1
SwinUperNet 92.8 95.6 85.1 90.6 85.1 89.8
DC-Swin 93.6 96.2 85.8 90.4 87.6 90.7
UNetFormer 92.7 95.3 84.9 90.6 88.5 90.4
SFA-Net 93.5 96.3 85.4 90.2 90.7 91.2
Ours 97.1 96.0 85.4 90.5 90.5 91.9

Comparison of semantic segmentation performance on the Cloud dataset.



Method mIoU ↑ OA ↑ mF1 ↑
MCDNet 33.85 69.75 42.76
SCNN 32.38 71.22 52.41
CDNetv1 34.58 68.16 45.80
KappaMask 42.12 76.63 68.47
UNetMobv2 47.76 82.00 56.91
CDNetv2 43.63 78.56 70.33
HRCloudNet 43.51 77.04 71.36
KTDA 51.49 83.55 60.08
SFA-Net 74.88 91.81 84.64
Ours 82.16 94.90 89.65

Comparison of semantic segmentation performance on the Grass dataset.



Method mIoU ↑ OA ↑ mF1 ↑
FCN 47.47 67.85 61.99
PSPNet 47.95 69.12 62.55
DeepLabV3+ 47.95 68.97 62.50
UNet 48.17 69.77 62.34
SegFormer 48.29 68.93 62.82
Mask2Former 44.93 65.90 58.91
DINOv2 47.57 71.54 61.70
KTDA 50.86 74.26 65.01
SFA-Net 51.21 71.41 65.76
Ours 51.96 72.26 66.27

BibTeX

@misc{zou2025dynamicdictionarylearningremote,
        title={Dynamic Dictionary Learning for Remote Sensing Image Segmentation}, 
        author={Xuechao Zou and Yue Li and Shun Zhang and Kai Li and Shiying Wang and Pin Tao and Junliang Xing and Congyan Lang},
        year={2025},
        eprint={2503.06683},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.06683},
      }