OICL

Towards Omni-Implicit Counterfactual Learning in Audio-Visual Segmentation

TPAMI Submitted (ICCV 2025 Extension)




Mingfeng Zha, Guoqing Wang*, Tianyu Li, Peng Wang, Yunqiang Pei, Jingcai Guo, Yang Yang, Heng Tao Shen

University of Electronic Science and Technology of China

The Hong Kong Polytechnic University

Tongji University





More resources will be released upon acceptance :)

Abstract

Audio-visual segmentation (AVS) requires precise pixel-level localization and classification guided by acoustic cues. However, existing methods often suffer from modality representation discrepancies and spurious correlations, where the model may over-rely on visual dominance rather than authentic audio-visual causality. In this work, we propose Omni-Implicit Counterfactual Learning (OICL), a unified causal inference paradigm for unbiased and fine-grained AVS. To bridge the gap between sparse audio and dense visual features, we first introduce Multi-granularity Implicit Text (MIT). By establishing a shared embedding space across video, segment, and frame levels, MIT provides a semantically grounded medium for cross-modal interaction. Building upon this, we develop a dual-pathway counterfactual intervention framework to disentangle causal dependencies. Specifically, Hierarchical Semantic Counterfactual (HSC) intervenes in the semantic space through a fast–slow mechanism, where implicit MIT cues enable rapid local adaptations while MLLM-driven descriptions provide stable context reasoning. Symmetrically, Hierarchical Perceptual Counterfactual (HPC) performs modality-aware interventions along audio and visual pathways, utilizing a similar fast–slow strategy to enhance robustness across diverse perceptual subspaces. These counterfactual constructions are optimized via Distribution-informed Cooperative Contrastive Learning (DCCL), a structured objective that jointly models factual–counterfactual, intra-modal, and inter-modal relationships, thereby suppressing spurious biases while promoting cohesive yet decoupled representations. Extensive experiments on six benchmarks demonstrate that OICL achieves state-of-the-art performance, with superior generalization in few-shot and open-vocabulary scenarios.


Demo Videos

Bassoon

Marimba

Airplane

Harp

Keyboard

Guitar & Man

Accordion & Guitar

Cello & Violin

Cello & bassoon & Violin


BibTeX

@article{zha2025oicl,
        title={Towards Omni-Implicit Counterfactual Learning in Audio-Visual Segmentation},
        author={Zha, Mingfeng and Wang, Guoqing and Li, Tianyu and Wang, Peng and Pei, Yunqiang and Guo, Jingcai and Yang, Yang and Shen, Heng Tao},
        journal={preprint},
        year={2026}
    }