University of Electronic Science and Technology of China
The Hong Kong Polytechnic University
Tongji University
More resources will be released upon acceptance :)
Audio-visual segmentation (AVS) requires precise pixel-level localization and classification guided by acoustic cues. However, existing methods often suffer from modality representation discrepancies and spurious correlations, where the model may over-rely on visual dominance rather than authentic audio-visual causality. In this work, we propose Omni-Implicit Counterfactual Learning (OICL), a unified causal inference paradigm for unbiased and fine-grained AVS. To bridge the gap between sparse audio and dense visual features, we first introduce Multi-granularity Implicit Text (MIT). By establishing a shared embedding space across video, segment, and frame levels, MIT provides a semantically grounded medium for cross-modal interaction. Building upon this, we develop a dual-pathway counterfactual intervention framework to disentangle causal dependencies. Specifically, Hierarchical Semantic Counterfactual (HSC) intervenes in the semantic space through a fast–slow mechanism, where implicit MIT cues enable rapid local adaptations while MLLM-driven descriptions provide stable context reasoning. Symmetrically, Hierarchical Perceptual Counterfactual (HPC) performs modality-aware interventions along audio and visual pathways, utilizing a similar fast–slow strategy to enhance robustness across diverse perceptual subspaces. These counterfactual constructions are optimized via Distribution-informed Cooperative Contrastive Learning (DCCL), a structured objective that jointly models factual–counterfactual, intra-modal, and inter-modal relationships, thereby suppressing spurious biases while promoting cohesive yet decoupled representations. Extensive experiments on six benchmarks demonstrate that OICL achieves state-of-the-art performance, with superior generalization in few-shot and open-vocabulary scenarios.
Bassoon
Marimba
Airplane
Harp
Keyboard
Guitar & Man
Accordion & Guitar
Cello & Violin
Cello & bassoon & Violin
@article{zha2025oicl,
title={Towards Omni-Implicit Counterfactual Learning in Audio-Visual Segmentation},
author={Zha, Mingfeng and Wang, Guoqing and Li, Tianyu and Wang, Peng and Pei, Yunqiang and Guo, Jingcai and Yang, Yang and Shen, Heng Tao},
journal={preprint},
year={2026}
}