D2LS

Dynamic Dictionary Learning for Remote Sensing
Image Segmentation

¹Beijing Jiaotong University, ²Qinghai University,
³Tsinghua University

ICCV 2025
^*Equal Contribution, ^†Corresponding Author

Abstract

Remote sensing image segmentation faces persistent challenges in distinguishing morphologically similar categories and adapting to diverse scene variations. While existing methods rely on implicit representation learning paradigms, they often fail to dynamically adjust semantic embeddings according to contextual cues, leading to suboptimal performance in fine-grained scenarios such as cloud thickness differentiation. This paper introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core innovation lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated via multi-stage alternating cross-attention querying between image features and dictionary embeddings. This process enables adaptive representation learning tailored to input-specific characteristics, effectively resolving ambiguities in intra-class heterogeneity and inter-class homogeneity. To further enhance discriminability, a contrastive constraint is applied to the dictionary space, ensuring compact intra-class distributions while maximizing inter-class separability. Extensive experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods, particularly in two online test benchmarks (LoveDA and UAVid). Code is available at GitHub.

Method

Framework: Our network architecture is built upon three key modules: an encoder, a dictionary generator, and a decoder. The encoder extracts rich, multi-scale features from the input imagery, which are then channeled into the dictionary generator. Here, a static set of class-specific embeddings is transformed into a dynamic dictionary through a dedicated modulator that leverages attention mechanisms to capture input-dependent contextual cues. Finally, the decoder employs iterative alternating cross-attention between the dynamic dictionary and image features, progressively refining both to produce precise segmentation maps tailored for complex remote sensing tasks.

t-SNE

The figure illustrates the evolution of the dynamic dictionary on the Grass dataset via t-SNE visualization. As iterative optimization progresses from the initial state (D0) to the final state (D3), intra-class distances significantly shrink—from 33.61 to 28.51—while inter-class distances gradually expand, enhancing class discriminability. This validates the effectiveness of dynamic dictionary learning in improving semantic recognition for fine-grained segmentation tasks.

Results

Visualization on the Six Datasets: LoveDA, UAVid, Potsdam, Vaihingen, Cloud, and Grass.

Method	Background	Building	Road	Water	Barren	Forest	Agriculture	mIoU ↑
TransUNet	43.0	56.1	53.7	78.0	9.3	44.9	56.9	48.9
DC-Swin	41.3	54.5	56.2	78.1	14.5	47.2	62.4	50.6
UNetFormer	44.7	58.8	54.9	79.6	20.1	46.0	62.5	52.4
Hi-Resnet	46.7	58.3	55.9	80.1	17.0	46.7	62.7	52.5
AerialFormer	47.8	60.7	59.3	81.5	17.9	47.9	64.0	54.1
SFA-Net	48.4	60.3	59.1	81.9	24.1	46.2	64.0	54.9
Ours	47.6	61.2	59.1	81.6	23.8	48.8	64.8	55.3

Method

Background

Building

Road

Water

Barren

Forest

Agriculture

mIoU ↑

TransUNet

43.0

56.1

53.7

78.0

9.3

44.9

56.9

48.9

DC-Swin

41.3

54.5

56.2

78.1

14.5

47.2

62.4

50.6

UNetFormer

44.7

58.8

54.9

79.6

20.1

46.0

62.5

52.4

Hi-Resnet

46.7

58.3

55.9

80.1

17.0

46.7

62.7

52.5

AerialFormer

47.8

60.7

59.3

81.5

17.9

47.9

64.0

54.1

SFA-Net

48.4

60.3

59.1

81.9

24.1

46.2

64.0

54.9

Ours

47.6

61.2

59.1

81.6

23.8

48.8

64.8

55.3

Method	Clutter	Building	Road	Tree	Vegetation	Moving Car	Static Car	Human	mIoU ↑
DANet	64.9	58.9	77.9	68.3	61.5	59.6	47.4	9.1	60.6
ABCNet	67.4	86.4	81.2	79.9	63.1	69.8	48.4	13.9	63.8
BANet	66.7	85.4	80.7	78.9	62.1	69.3	52.8	21.0	64.6
SegFormer	66.6	86.3	80.1	79.6	62.3	72.5	52.5	28.5	66.0
UNetFormer	68.4	87.4	81.5	80.2	63.5	73.6	56.4	31.0	67.8
SFA-Net	70.2	89.0	82.7	80.8	64.6	77.5	67.5	30.7	70.4
Ours	71.0	89.7	83.2	82.1	66.1	75.0	59.0	41.4	70.9

Method

Clutter

Building

Road

Tree

Vegetation

Moving Car

Static Car

Human

mIoU ↑

DANet

64.9

58.9

77.9

68.3

61.5

59.6

47.4

9.1

60.6

ABCNet

67.4

86.4

81.2

79.9

63.1

69.8

48.4

13.9

63.8

BANet

66.7

85.4

80.7

78.9

62.1

69.3

52.8

21.0

64.6

SegFormer

66.6

86.3

80.1

79.6

62.3

72.5

52.5

28.5

66.0

UNetFormer

68.4

87.4

81.5

80.2

63.5

73.6

56.4

31.0

67.8

SFA-Net

70.2

89.0

82.7

80.8

64.6

77.5

67.5

30.7

70.4

Ours

71.0

89.7

83.2

82.1

66.1

75.0

59.0

41.4

70.9

Method	Impervious Surface	Building	Low Vegetation	Tree	Car	mF1 ↑
DANet	91.0	95.6	86.1	87.6	84.3	88.9
ABCNet	93.5	96.9	87.9	89.1	95.8	92.7
Segmenter	91.5	95.3	85.4	85.0	88.5	89.2
BANet	93.3	96.7	87.4	89.1	96.0	92.5
SwinUperNet	93.2	96.4	87.6	88.6	95.4	92.2
DC-Swin	94.2	97.6	88.6	96.3	96.3	93.3
UNetFormer	93.6	97.2	87.7	90.6	96.5	93.5
AerialFormer	95.5	98.1	89.8	90.8	97.5	94.1
SFA-Net	95.0	97.5	88.3	90.2	97.1	93.5
Ours	96.1	97.9	90.6	91.2	97.5	94.7

Method

Impervious Surface

Building

Low Vegetation

Tree

Car

mF1 ↑

DANet

91.0

95.6

86.1

87.6

84.3

88.9

ABCNet

93.5

96.9

87.9

89.1

95.8

92.7

Segmenter

91.5

95.3

85.4

85.0

88.5

89.2

BANet

93.3

96.7

87.4

89.1

96.0

92.5

SwinUperNet

93.2

96.4

87.6

88.6

95.4

92.2

DC-Swin

94.2

97.6

88.6

96.3

96.3

93.3

UNetFormer

93.6

97.2

87.7

90.6

96.5

93.5

AerialFormer

95.5

98.1

89.8

90.8

97.5

94.1

SFA-Net

95.0

97.5

88.3

90.2

97.1

93.5

Ours

96.1

97.9

90.6

91.2

97.5

94.7

Method	Impervious Surface	Building	Low Vegetation	Tree	Car	mF1 ↑
DANet	90.0	93.9	82.2	87.3	44.5	79.6
ABCNet	92.7	95.2	84.5	89.7	85.3	89.5
BANet	92.2	95.2	83.8	89.9	86.8	89.6
Segmenter	89.8	93.0	81.2	88.9	67.6	84.1
SwinUperNet	92.8	95.6	85.1	90.6	85.1	89.8
DC-Swin	93.6	96.2	85.8	90.4	87.6	90.7
UNetFormer	92.7	95.3	84.9	90.6	88.5	90.4
SFA-Net	93.5	96.3	85.4	90.2	90.7	91.2
Ours	97.1	96.0	85.4	90.5	90.5	91.9

Method

Impervious Surface

Building

Low Vegetation

Tree

Car

mF1 ↑

DANet

90.0

93.9

82.2

87.3

44.5

79.6

ABCNet

92.7

95.2

84.5

89.7

85.3

89.5

BANet

92.2

95.2

83.8

89.9

86.8

89.6

Segmenter

89.8

93.0

81.2

88.9

67.6

84.1

SwinUperNet

92.8

95.6

85.1

90.6

85.1

89.8

DC-Swin

93.6

96.2

85.8

90.4

87.6

90.7

UNetFormer

92.7

95.3

84.9

90.6

88.5

90.4

SFA-Net

93.5

96.3

85.4

90.2

90.7

91.2

Ours

97.1

96.0

85.4

90.5

91.9

Method	mIoU ↑	OA ↑	mF1 ↑
MCDNet	33.85	69.75	42.76
SCNN	32.38	71.22	52.41
CDNetv1	34.58	68.16	45.80
KappaMask	42.12	76.63	68.47
UNetMobv2	47.76	82.00	56.91
CDNetv2	43.63	78.56	70.33
HRCloudNet	43.51	77.04	71.36
KTDA	51.49	83.55	60.08
SFA-Net	74.88	91.81	84.64
Ours	82.16	94.90	89.65

Method

mIoU ↑

OA ↑

mF1 ↑

MCDNet

33.85

69.75

42.76

SCNN

32.38

71.22

52.41

CDNetv1

34.58

68.16

45.80

KappaMask

42.12

76.63

68.47

UNetMobv2

47.76

82.00

56.91

CDNetv2

43.63

78.56

70.33

HRCloudNet

43.51

77.04

71.36

KTDA

51.49

83.55

60.08

SFA-Net

74.88

91.81

84.64

Ours

82.16

94.90

89.65

Method	mIoU ↑	OA ↑	mF1 ↑
FCN	47.47	67.85	61.99
PSPNet	47.95	69.12	62.55
DeepLabV3+	47.95	68.97	62.50
UNet	48.17	69.77	62.34
SegFormer	48.29	68.93	62.82
Mask2Former	44.93	65.90	58.91
DINOv2	47.57	71.54	61.70
KTDA	50.86	74.26	65.01
SFA-Net	51.21	71.41	65.76
Ours	51.96	72.26	66.27

Method

mIoU ↑

OA ↑

mF1 ↑

FCN

47.47

67.85

61.99

PSPNet

47.95

69.12

62.55

DeepLabV3+

47.95

68.97

62.50

UNet

48.17

69.77

62.34

SegFormer

48.29

68.93

62.82

Mask2Former

44.93

65.90

58.91

DINOv2

47.57

71.54

61.70

KTDA

50.86

74.26

65.01

SFA-Net

51.21

71.41

65.76

Ours

51.96

72.26

66.27

BibTeX

@misc{zou2025dynamicdictionarylearningremote, title={Dynamic Dictionary Learning for Remote Sensing Image Segmentation}, author={Xuechao Zou and Yue Li and Shun Zhang and Kai Li and Shiying Wang and Pin Tao and Junliang Xing and Congyan Lang}, year={2025}, eprint={2503.06683}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.06683}, }

Dynamic Dictionary Learning for Remote Sensing
Image Segmentation

Abstract

Method

t-SNE

Results

Visualization on the Six Datasets: LoveDA, UAVid, Potsdam, Vaihingen, Cloud, and Grass.

Comparison of semantic segmentation performance on the LoveDA dataset.

Comparison of semantic segmentation performance on the UAVid dataset.

Comparison of semantic segmentation performance on the Potsdam dataset.

Comparison of semantic segmentation performance on the Vaihingen dataset.

Comparison of semantic segmentation performance on the Cloud dataset.

Comparison of semantic segmentation performance on the Grass dataset.

BibTeX