Towards Omni-Implicit Counterfactual Learning in Audio-Visual Segmentation

Abstract

Audio-visual segmentation (AVS) requires precise pixel-level localization and classification guided by acoustic cues. However, existing methods often suffer from modality representation discrepancies and spurious correlations, where the model may over-rely on visual dominance rather than authentic audio-visual causality. In this work, we propose Omni-Implicit Counterfactual Learning (OICL), a unified causal inference paradigm for unbiased and fine-grained AVS. To bridge the gap between sparse audio and dense visual features, we first introduce Multi-granularity Implicit Text (MIT). By establishing a shared embedding space across video, segment, and frame levels, MIT provides a semantically grounded medium for cross-modal interaction. Building upon this, we develop a dual-pathway counterfactual intervention framework to disentangle causal dependencies. Specifically, Hierarchical Semantic Counterfactual (HSC) intervenes in the semantic space through a fast–slow mechanism, where implicit MIT cues enable rapid local adaptations while MLLM-driven descriptions provide stable context reasoning. Symmetrically, Hierarchical Perceptual Counterfactual (HPC) performs modality-aware interventions along audio and visual pathways, utilizing a similar fast–slow strategy to enhance robustness across diverse perceptual subspaces. These counterfactual constructions are optimized via Distribution-informed Cooperative Contrastive Learning (DCCL), a structured objective that jointly models factual–counterfactual, intra-modal, and inter-modal relationships, thereby suppressing spurious biases while promoting cohesive yet decoupled representations. Extensive experiments on six benchmarks demonstrate that OICL achieves state-of-the-art performance, with superior generalization in few-shot and open-vocabulary scenarios.

Bassoon