IRIP实验室今年共有4篇论文被多媒体顶级会议ACM Multimedia 2023接收!此次会议有效投稿量达3072篇,接收论文902篇,接收率约为29.3%。此次会议将于加拿大渥太华召开。
接收论文简要介绍如下:
1. Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing (Jiayi Zhang, Weixin Li)
弱监督音视频解析(AVVP)任务旨在将视频解析为一组模态事件(即听觉、视觉或二者兼有),识别这些事件的类别并定位它们的时间边界。 考虑到多模态视频中音视频同步和异步内容的普遍存在,捕获和整合不同时刻和时间尺度上发生的上下文事件至关重要。 然而,目前对于综合跨模态多尺度时间融合策略的研究仍然相对有限。 为弥补这一研究空白,我们提出了一种名为“音视频融合架构搜索”(Audio Visual Fusion Architecture Search, AVFAS)的新框架,该框架能够自动发现模态内部和模态间的最优多尺度时间融合策略。 我们的框架生成一组具有不同时间尺度的音频和视觉特征,并利用三个模态级模块搜索多尺度特征选择和融合策略,共同建模模态特定的判别信息。 此外,为增强音视频异步内容的对齐,我们引入了一种“位置和长度自适应时序注意力”(Position- and Length-Adaptive Temporal Attention, PLATA)机制进行跨模态特征融合。 广泛的定量和定性实验结果证明了我们框架的有效性和高效性。
The weakly supervised audio-visual video parsing (AVVP) task aims to parse a video into a set of modality-wise events (i.e., audible, visible, or both), recognize categories of these events, and localize their temporal boundaries. Given the prevalence of audio-visual synchronous and asynchronous contents in multi-modal videos, it is crucial to capture and integrate the contextual events occurring at different moments and temporal scales. Although some researchers have made preliminary attempts at modeling event semantics with various temporal lengths, they mostly only perform a late fusion of multi-scale features across modalities. A comprehensive crossmodal and multi-scale temporal fusion strategy remains largely unexplored in the literature. To address this gap, we propose a novel framework named Audio-Visual Fusion Architecture Search (AVFAS) that can automatically find the optimal multi-scale temporal fusion strategy within and between modalities. Our framework generates a set of audio and visual features with distinct temporal scales and employs three modality-wise modules to search multi-scale feature selection and fusion strategies, jointly modeling modality-specific discriminative information. Furthermore, to enhance the alignment of audio-visual asynchrony, we introduce a Position and Length-Adaptive Temporal Attention (PLATA) mechanism for cross-modal feature fusion. Extensive quantitative and qualitative experimental results demonstrate the effectiveness and efficiency of our framework.
2.MIEP: Channel Pruning with Multi-granular Importance Estimation for Object Detection (Liangwei Jiang, Jiaxin Chen, Di Huang, and Yunhong Wang)
最近计算高效的目标检测器受到越来越多的关注,特别是当部署在资源受限的设备上时。本文研究使用通道剪枝将预训练的深度目标检测器压缩为轻量级目标检测器,这在提高效率方面是有效且灵活的。然而,大多数现有方法基于通用目的的单调标准(即特定任务损失的重要性)来修剪通道。它们容易过度修剪中间层,同时留下大量的层内冗余,严重影响检测精度。为了解决上述问题,我们提出了一种基于多粒度重要性估计(MIEP)的新型通道修剪方法,包括特征级目标敏感重要性(FOI)和层内冗余感知重要性(IRI)。前者利用预训练模型中的对象特征作为指导,为表示目标至关重要的通道赋予较大的权重,并与基于任务的损失相结合时减轻过度剪枝。后者基于聚类对高度相关的通道进行分组,随后优先修剪这些通道以减少冗余。对 COCO 和 VOC 基准的大量实验表明,MIEP 明显优于最先进的通道修剪方法,与轻量级目标检测器相比,在准确性和效率之间实现了更好的平衡,并且可以很好地推广到各种检测框架(例如Faster-RCNN 和 FSAF)和任务(例如分类)。
Computationally efficient object detection has recently received increasing attention especially when deployed on resource-constrained devices. This paper investigates compressing a pre-trained deep object detector to a light-weight one by channel pruning, which has proved effective and flexible in promoting efficiency. However, the majority of existing works trim channels based on a monotonous criterion for general purposes, i.e., the importance to the task-specific loss. They are prone to overly prune intermediate layers and simultaneously leave large intra-layer redundancy, severely deteriorating the detection accuracy. To address the issues above, we propose a novel channel pruning approach with multi-granular importance estimation (MIEP), consisting of the Feature-level Object-sensitive Importance (FOI) and the Intra-layer Redundancy-aware Importance (IRI). The former puts large weights on channels that are critical for object representation through the guidance of object features from the pre-trained model, and mitigates over-pruning when combined with the task-specific loss. The latter groups highly correlated channels based on clustering, which are subsequently pruned with priority to decrease redundancy. Extensive experiments on the COCO and VOC benchmarks demonstrate that MIEP remarkably outperforms the state-of-the-art channel pruning approaches, achieves a better balance between accuracy and efficiency compared to light-weight object detectors, and generalizes well to various detection frameworks (e.g., Faster-RCNN and FSAF) and tasks (e.g., classification).