Textual Query-Driven Mask Transformer for Domain Generalized Segmentation

Agency for Defense Development (ADD)
* Equal Contribution

Accepted to ECCV 2024

Trained on GTA5, our tqdm generalizes well to unseen game videos with extreme domain shifts.

Abstract

In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. We employ the text embeddings as object queries within a transformer-based segmentation framework (textual object queries). These queries are regarded as a domain-invariant basis for pixel grouping in DGSS. To leverage the power of textual object queries, we introduce a novel framework named the textual query-driven mask transformer (tqdm). Our tqdm aims to (1) generate textual object queries that maximally encode domain-invariant semantics and (2) enhance the semantic clarity of dense visual features. Additionally, we suggest three regularization losses to improve the efficacy of tqdm by aligning between visual and textual features. By utilizing our method, the model can comprehend inherent semantic information for classes of interest, enabling it to generalize to extreme domains (e.g., sketch style). Our tqdm achieves 68.9 mIoU on GTA5→Cityscapes, outperforming the prior state-of-the-art method by 2.5 mIoU.

Visual Comparisons

Qualitative Results on GTA5→{Cityscapes, BDD100K, Mapillary}

Input
Rein [Wei 2024]
Ours
Ground Truth

Qualitative Results under Extreme Domain Shifts (trained on GTA5)

Input
Rein [Wei 2024]
Ours

Comparison with DGSS methods

Paper Clip, Paper Clip, and Paper Clip denote initialization with CLIP, EVA02-CLIP, and DINOv2 pre-training, respectively. The best and second-best results are highlighted and underlined, correspondingly. The results denoted with † are both trained and tested with an input resolution of 1024x1024.

Domain-Invariant Semantic Knowledge

The text embeddings of the targeted classes (i.e., building, bicycle, and traffic sign) from pre-trained vision-language models (e.g., CLIP) consistently show strong activation in the corresponding regions of images across various domains. We leverage this domain-invariant semantic knowledge from the text embeddings of vision-language models.

Overall Pipeline of tqdm

Step 1. We generate initial textual object queries \( \textbf{q}^0_\textbf{t} \) from the \( K \) class text embeddings \( \{\textbf{t}_k\}_{k=1}^K \).

Step 2. To improve the segmentation capabilities of these queries, we incorporate text-to-pixel attention within the pixel decoder. This process enhances the semantic clarity of pixel features, while reconstructing high-resolution per-pixel embeddings \( \textbf{Z} \).

Step 3. The transformer decoder refines these queries for the final prediction. Each prediction output is then assigned to its corresponding ground truth (GT) through fixed matching, ensuring that each query consistently represents the semantic information of one class.

BibTeX


    @article{pak2024textual,
      title     = {Textual Query-Driven Mask Transformer for Domain Generalized Segmentation},
      author    = {Pak, Byeonghyun and Woo, Byeongju and Kim, Sunghwan and Kim, Dae-hwan and Kim, Hoseong},
      journal   = {arXiv preprint arXiv:2407.09033},
      year      = {2024}
    }