OICL

Towards Omni-Implicit Counterfactual Learning in Audio-Visual Segmentation

ICCV 2025 Extension

University of Electronic Science and Technology of China

The Hong Kong Polytechnic University

Tongji University

More resources will be released upon acceptance :)

Abstract

Audio-visual segmentation (AVS) requires precise pixel-level localization and classification guided by acoustic cues. However, existing methods often suffer from modality representation discrepancies and spurious correlations, where the model may over-rely on visual dominance rather than authentic audio-visual causality. In this work, we propose Omni-Implicit Counterfactual Learning (OICL), a unified causal inference paradigm for unbiased and fine-grained AVS. To bridge the gap between sparse audio and dense visual features, we first introduce Multi-granularity Implicit Text (MIT). By establishing a shared embedding space across video, segment, and frame levels, MIT provides a semantically grounded medium for cross-modal interaction. Building upon this, we develop a dual-pathway counterfactual intervention framework to disentangle causal dependencies. Specifically, Hierarchical Semantic Counterfactual (HSC) intervenes in the semantic space through a fast–slow mechanism, where implicit MIT cues enable rapid local adaptations while MLLM-driven descriptions provide stable context reasoning. Symmetrically, Hierarchical Perceptual Counterfactual (HPC) performs modality-aware interventions along audio and visual pathways, utilizing a similar fast–slow strategy to enhance robustness across diverse perceptual subspaces. These counterfactual constructions are optimized via Distribution-informed Cooperative Contrastive Learning (DCCL), a structured objective that jointly models factual–counterfactual, intra-modal, and inter-modal relationships, thereby suppressing spurious biases while promoting cohesive yet decoupled representations. Extensive experiments on six benchmarks demonstrate that OICL achieves state-of-the-art performance, with superior generalization in few-shot and open-vocabulary scenarios.

Demo Videos

Bassoon

Marimba

Airplane

Harp

Keyboard

Guitar & Man

Accordion & Guitar

Cello & Violin

Cello, Bassoon & Violin

BibTeX

@article{zha2026oicl,
    title={Towards Omni-Implicit Counterfactual Learning in Audio-Visual Segmentation},
    author={Zha, Mingfeng and Wang, Guoqing and Li, Tianyu and Wang, Peng and Pei, Yunqiang and Guo, Jingcai and Yang, Yang and Shen, Heng Tao},
    journal={preprint},
    year={2026},
    url={https://winter-flow.github.io/project/OICL}
}