Logo

Related Works

UDA-SST

In real life, it is usually expensive to obtain annotated data, especially for dense prediction tasks like semantic segmentation. Therefore, synthetic images with automatically generated labels are used to train segmentation models. However, the models trained from synthetic data are hard to transfer to real data without labels due to the large domain gap. Unsupervised Domain Adaptation for Semantic Segmentation Task (UDA-SST) is a promising method to solve this problem. Specifically, UDA-SST can be formulated as follows: given labeled source domain samples ($X_{src}$, $Y_{src}$) and unlabeled target domain samples ($X_{tgt}$), we want to train the model such that it can produce high-quality segmentation on target domain samples ($X_{tgt}$). Below we will introduce some previous works in UDA-SST.

ProDA

ProDA is one of the state-of-the-art in UDA-SST. It applies self-training, i.e. the model is trained with pseudo labels generated by itself. However, the pseudo label is inevitably inaccurate. The model may severely suffer from the noise in the pseudo label and lead to performance degradation. To address this issue, ProDA utilizes the prototypes, i.e. feature centroid of each class. They propose to rectify the soft pseudo label based on the distances from the prototypes. The idea is that if the feature of a pixel is closed to a particular class centroid and far away from others, it is more likely that the pixel belongs to that class. Hence, we could rectify the pseudo label by reweighting the soft pseudo label based on the relative distance to the prototype. Meanwhile, the prototype will be updated using exponential moving average. Thus, the network can learn from the online-denoised pseudo label throughout training. Moreover, they conduct multiple distillation stages with the self-supervised pretrained weight to further improve the performance.

DACS

DACS tries to combat the problem of class conflation where the segmentation model cannot distinguish semantically similar classes in the target domain. For example, ‘sidewalk’ is likely to be predicted as ‘road’ since ‘road’ appears more frequently and thus is easier to be adapted. DACS proposes a novel way of data augmentation that mixes the target domain images with the source domain images. Specifically, the pixels of some randomly selected classes on the source domain images will be cut out and pasted on the target domain images. The pseudo label of the target domain images and the ground-truth of the source domain images are mixed in the same way to produce the mixed label. The cross-domain mixing augmentation is simple yet effective. It can alleviate the class conflation problem because parts of the mixed label are the source domain ground-truth labels. In other words, the network won’t be severely biased toward the pseudo label of the target domain that has a class conflation problem.

CorDA

CorDA takes the guidance from self-supervised depth estimation to build the domain gap. The information of depth is domain invariant that is able to improve the quality of segmentation. CorDA extends an auxiliary branch to predict depth of the image and shares the encoder with the segmentation prediction branch. The mid-level features will be fed into a domain-shared task feature correlation module which contains two attention modules. Integrating the mid-level features of another task, the module captures the information from the other modality. Finally, the semantic pseudo labels will be refined by depth prediction discrepancy to approximate the adaptation difficulty, as depth and semantics are related so that the model can be benefit from the depth representation. In addition, depth estimation can be learned from image sequence or stereo images that can be obtained in both source and target domain.

Rainbow UDA

Rainbow UDA addresses the certainty inconsistency and performance variation issues of the previous ensemble-distillation frameworks when combining multiple UDA-SST models. The proposed method introduces two operations: the unification and the combinatorial fusion operations. Unification operation unifies the output certainty predictions from each teacher model by converting it to one-hot hard pseudo labels or fine-tuning teacher models, solving the certainty inconsistency issue. Combinatorial fusion operation fuses the predictions of different teachers by determining a teacher to predict a class based on the per-class performance of each teacher, addressing the performance variation issue. The compact student model will be trained on the target domain images and the corresponding fused pseudo label to distill the knowledge from teacher models to itself.