Toward Stable Semi-Supervised Remote Sensing
Segmentation via Co-Guidance and Co-Fusion

1Qinghai University 2Beijing Jiaotong University 3Tsinghua University

Overall Results Comparison

Abstract

Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios.

Methodology

Co2S Framework Overview

Overview of the proposed Co2S framework. It integrates a CLIP-based student (top) using text embeddings for explicit semantic guidance and a DINOv3-based student (bottom) using learnable queries for implicit guidance. For unlabeled data, the global-local collaborative fusion strategy enforces training stability by arbitrating supervision based on pixel-wise confidence.

Attention Visualization

CLIP Guidance Detail

Visualization of attention maps from different heads of the CLIP image encoder (a-c) and DINOv3 backbone (d-f).

Experimental Results

Comparison with state-of-the-art methods across six datasets.
(Bold: Best, Underline: Second Best)

Table I: Segmentation results (mIoU %) on WHDLD dataset.
Method 1/24 1/16 1/8 1/4
OnlySup53.655.258.060.5
FixMatch56.057.659.860.8
U2PL57.158.359.961.1
WSCL56.858.659.861.1
UniMatch57.458.860.461.5
DWL56.557.057.158.9
MUCA56.558.260.060.5
Co2S (Ours) 61.1 61.5 62.2 62.6
Table II: LoveDA (1/40 Labeled Images)
MethodBackgr.Build.RoadWaterBarrenForestAgri.mIoU
OnlySup48.037.536.155.327.353.257.145.9
FixMatch50.352.153.369.928.756.462.353.3
U2PL47.447.550.969.830.056.563.852.3
WSCL52.952.162.868.131.257.763.854.1
UniMatch52.855.954.670.130.259.365.555.5
DWL50.457.857.775.736.458.565.157.4
MUCA49.250.851.073.533.958.964.754.6
Co2S (Ours)51.659.758.376.535.961.564.258.2
Table II: LoveDA (1/16 Labeled Images)
MethodBackgr.Build.RoadWaterBarrenForestAgri.mIoU
OnlySup50.851.049.565.933.557.163.653.1
FixMatch55.558.455.472.237.459.366.957.9
U2PL53.558.356.672.638.059.967.358.0
WSCL53.855.855.271.637.561.366.657.4
UniMatch55.759.257.072.437.760.967.658.6
DWL52.760.659.173.736.062.968.259.0
MUCA52.654.154.575.034.161.265.056.7
Co2S (Ours)54.062.259.076.340.063.667.660.4
Table II: LoveDA (1/8 Labeled Images)
MethodBackgr.Build.RoadWaterBarrenForestAgri.mIoU
OnlySup52.856.453.469.838.059.766.256.6
FixMatch55.960.958.873.140.662.369.260.1
U2PL54.860.959.472.040.361.668.459.6
WSCL56.660.058.572.938.661.168.059.4
UniMatch56.861.159.673.339.461.768.360.0
DWL56.261.560.078.947.166.669.462.8
MUCA55.557.458.277.338.365.469.260.2
Co2S (Ours)57.263.860.479.342.366.769.262.7
Table II: LoveDA (1/4 Labeled Images)
MethodBackgr.Build.RoadWaterBarrenForestAgri.mIoU
OnlySup55.758.857.972.842.353.268.558.9
FixMatch56.961.960.673.843.364.269.861.5
U2PL57.461.560.674.642.964.369.261.5
WSCL57.661.359.374.343.463.268.661.1
UniMatch58.061.160.874.244.663.569.861.7
DWL56.561.860.679.344.568.872.463.4
MUCA55.961.861.178.044.868.371.663.1
Co2S (Ours)57.063.461.079.148.367.871.264.0
Table III: Segmentation results (mIoU %) on Potsdam dataset.
Method1/321/161/81/4
OnlySup61.564.970.273.8
FixMatch69.471.573.575.7
U2PL67.271.373.375.9
WSCL69.972.373.975.4
UniMatch70.772.674.876.4
DWL74.278.379.880.3
MUCA66.971.576.079.2
Co2S (Ours)74.376.679.880.2
Table IV: MER (1/8 Labeled Images)
MethodSoilSandsGravelBedrockRocksTracksShadowsUnk.Bg.mIoU
OnlySup4.755.171.461.738.049.534.273.087.052.8
FixMatch2.053.673.565.840.054.831.572.490.453.8
U2PL5.251.873.270.545.754.626.273.893.254.9
WSCL3.554.869.962.836.056.926.176.080.951.9
UniMatch5.554.472.768.544.755.238.073.891.156.0
DWL5.951.074.967.241.461.928.369.790.454.5
MUCA5.356.072.660.536.152.934.567.787.852.6
Co2S (Ours)3.056.373.167.937.066.336.579.791.356.8
Table IV: MER (1/4 Labeled Images)
MethodSoilSandsGravelBedrockRocksTracksShadowsUnk.Bg.mIoU
OnlySup21.656.374.466.338.052.322.174.592.255.3
FixMatch19.455.474.057.143.657.339.875.790.958.1
U2PL7.156.473.769.741.960.823.381.490.356.1
WSCL8.252.871.165.535.761.031.479.089.054.9
UniMatch19.754.674.369.344.554.937.477.490.558.1
DWL13.855.974.870.336.257.928.866.089.754.7
MUCA1.055.173.865.638.456.328.673.784.953.0
Co2S (Ours)13.163.376.169.437.163.141.177.191.459.1
Table V: MSL (1/8 Labeled Images)
MethodSoilSandsGravelBedrockRocksTracksShadowsUnk.Bg.mIoU
OnlySup26.861.666.872.237.945.537.049.185.153.6
FixMatch31.065.469.675.040.449.639.854.286.556.8
U2PL26.764.770.373.339.843.140.353.187.755.5
WSCL25.865.068.373.938.869.639.657.086.058.2
UniMatch29.366.769.875.439.250.843.057.989.257.9
DWL33.565.971.274.744.748.844.856.390.959.0
MUCA15.463.967.071.543.141.041.055.982.952.4
Co2S (Ours)36.768.471.776.042.658.944.160.989.360.9
Table V: MSL (1/4 Labeled Images)
MethodSoilSandsGravelBedrockRocksTracksShadowsUnk.Bg.mIoU
OnlySup32.265.069.075.143.657.644.057.986.659.0
FixMatch29.765.569.977.142.563.848.160.787.560.5
U2PL36.162.772.776.343.363.744.265.690.161.6
WSCL30.964.770.475.542.058.342.865.089.759.9
UniMatch32.267.772.677.042.571.849.161.990.862.8
DWL33.868.272.376.143.257.948.270.791.963.6
MUCA33.769.372.976.946.240.449.955.689.259.3
Co2S (Ours)36.067.771.877.845.577.049.276.791.765.9
Table VI: Segmentation results (mIoU %) on GID-15 dataset.
Method 1/8 1/4
OnlySup67.574.1
FixMatch70.074.8
U2PL67.275.3
WSCL72.875.4
UniMatch73.975.9
DWL72.676.9
MUCA67.472.6
Co2S (Ours) 75.4 77.7

Citation

@article{Co2S, title={Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion}, author={Zhou, Yi and Zou, Xuechao and Zhang, Shun and Li, Kai and Wang, Shiying and Chen, Jingming and Lang, Congyan and Cao, Tengfei and Tao, Pin and Shi, Yuanchun}, year={2025}, }