IRIP实验室今年共有7篇论文被人工智能学术会议AAAI 2025接收。AAAI(AAAI Conference on Artificial Intelligence)由人工智能促进会主办,是人工智能领域的顶级国际学术会议之一。本届AAAI会议收到了12,957篇有效投稿,最终仅录用了 3,032篇论文,录用率为 23.4%。
接收论文简介如下:
1.Unified Knowledge Maintenance Pruning and Progressive Recovery with Weight Recalling for Large Vision-Language Models. (Zimeng Wu, Jiaxin Chen, and Yunhong Wang)
大型视觉语言模型(LVLMs)的各模态分支不断扩展,加重了其存储和计算负担。因此,本文重点研究LVLMs的统一结构化剪枝方法,以促进其在资源受限场景的部署。然而,现有方法大多面向大语言模型设计,在大型视觉语言模型上表现不佳。本文分析其局限性来自三点:1)缺乏对全局参数重要性不平衡的处理;2)忽视LVLMs校验状态变化;3)忽视大模型对模型能力需求的变化。为此,我们提出一种新颖的结构化剪枝方法,包括用于剪枝的统一知识保持的参数重要性准则和基于LoRA的渐进蒸馏恢复。剪枝阶段,参数重要性准则通过自适应归一化以平衡模块和模态间的重要性差异,通过子任务筛选细化基于梯度的Taylor重要性准则,并结合角度分布信息熵引入对模型知识容量的感知。恢复训练阶段,我们提出了一种权重召回模块,重新利用被剪枝参数中嵌入的知识,并结合渐进蒸馏策略,实现在有限数据下更高效、全面的恢复。广泛的实验表明我们的方法优于目前最先进的结构化剪枝方法,特别在高剪枝率下能有效缓解模型崩溃问题。
The continual expansion of both modality branches in large vision-language models (LVLMs) has significantly increased their storage and computational burdens. This paper focuses on developing a unified structured pruning method for LVLMs to facilitate their deployment in resource-constrained environments. However, existing methods are primarily constructed on large language models (LLMs), yielding suboptimal results when applied to LVLMs. We identify three key limitations leading to this issue: 1) absence of handling for global imbalance in parameter importance, 2) neglect of calibration state changes in LVLMs, and 3) oversight of the evolving capability requirements of large models. To address these issues, we propose a novel structured pruning method comprising a unified knowledge maintenance importance metric for pruning and a LoRA-based progressive distillation process for recovery. In the pruning stage, the parameter importance metric employs an adaptive normalization to balance both block-wise and modality-wise discrepancies, refines gradient-based Taylor importance criteria by sub-task selection, and incorporates angle distribution entropy to maintain knowledge capacity. During the recovery stage, we introduce a weight recalling module to reuse the knowledge embedded in pruned parameters and integrate a progressive distillation strategy to achieve more efficient and comprehensive recovery with limited data. Extensive experiments demonstrate the effectiveness of our method, by comparing with state-of-the-art structured pruning methods. Especially under high pruning ratios, our method effectively mitigates model collapse issues.
2.Micro-macro Wavelet-based Gaussian Splatting for 3D Reconstruction from Unconstrained Images. (Yihui Li, Chengxin Lv, Hongyu Yang, Di Huang)
由于不受约束的图像集合中的外观变化和瞬时遮挡,与现有方法遵循场景为非变化的假设相违背,从中进行3D重建面临巨大挑战。本文一种称为Micro-macro Wavelet-based Gaussian Splatting(MW-GS)的新颖方法,旨在通过将场景表示分解为全局、精细和内在组件来增强 3D 重建。所提出的方法具有两个关键创新:微-宏投影,它允许高斯点从多个尺度的特征图中捕获细节并增强多样性;和基于小波的多尺度采样,它利用频域信息来细化特征表示并显著改善场景外观的建模。此外,结合了分层残差融合网络来无缝集成这些功能。大量实验表明,MW-GS 提供了最先进的渲染性能,超越了现有方法。
Due to appearance variations and transient occlusions in unconstrained image collections, performing 3D reconstruction poses significant challenges, as it contradicts the assumption of a static scene commonly adopted by existing methods. This paper proposes a novel approach called Micro-macro Wavelet-based Gaussian Splatting (MW-GS), which enhances 3D reconstruction by decomposing scene representation into global, fine-grained, and intrinsic components. The proposed method introduces two key innovations: micro-macro projection, which allows Gaussian points to capture details and enhance diversity from multi-scale feature maps; and wavelet-based multi-scale sampling, which leverages frequency domain information to refine feature representations and significantly improve scene appearance modeling. Additionally, a hierarchical residual fusion network is incorporated to seamlessly integrate these components. Extensive experiments demonstrate that MW-GS achieves state-of-the-art rendering performance, surpassing existing methods.
3.Unveiling the Knowledge of CLIP for Training-Free Open-Vocabulary Semantic Segmentation. (Yajie Liu,Guodong Wang, Jinjin Zhang, Qingjie Liu, and Di Huang)
无需训练的开放词汇语义分割旨在探索冻结的视觉语言模型在分割任务中的潜力。最近的研究工作基于CLIP最后层的特征重建用于分割的密集表示。然而,最后层倾向于优先考虑全局信息而不是局部表示,导致现有方法的鲁棒性和有效性不理想。在本文中,我们提出了一种新颖的无需训练的框架,它充分利用 CLIP 中跨层的多样化知识进行密集预测。我们的研究揭示了两个关键发现:首先,与最后层相比,中间层的特征表现出较高的语义一致性,在此基础上,我们提出了语义连贯增强的残差注意力模块。其次,尽管深层与文本没有直接对齐,但它们捕获了有效的局部语义。利用这一发现,我们引入了深度语义集成模块来增强最后层中的局部语义。使用各种 CLIP 模型在 9 个分割基准上进行的实验表明,CLIPSeg 的表现大幅超越了所有无需训练的方法。
Training-free open-vocabulary semantic segmentation aims to explore the potential of frozen Vision-Language models for segmentation tasks. Recent works reform the inference process of CLIP and utilize the features from the final layer to reconstruct dense representations for segmentation, demonstrating promising performance. However, the final layer tends to prioritize global components over local representations, leading to suboptimal robustness and effectiveness of existing methods. In this paper, we propose CLIPSeg, a novel training-free framework that fully exploits the diverse knowledge across layers in CLIP for dense predictions. Our study unveils two key discoveries: Firstly, the features in the middle layers exhibit high locality awareness and feature coherence compared to the final layer, based on which we propose the coherence enhanced residual attention module that generates semantic-aware attention. Secondly, despite not being directly aligned with the text, the deep layers capture valid local semantics that complement those in the final layer. Leveraging this insight, we introduce the deep semantic integration module to boost the patch semantics in the final block. Experiments conducted on 9 segmentation benchmarks with various CLIP models demonstrate that CLIPSeg consistently outperforms all training-free methods by substantial margins, e.g., a 7.8% improvement in average mIoU for CLIP with a ViT-L backbone, and competes with learning-based counterparts in generalizing to novel concepts in an efficient way.
4.GeoBEV: Learning Geometric BEV Representation for Multi-view 3D Object Detection. (Jinqing Zhang, Yanan Zhang, Yunlong Qi, Zehua Fu, Qingjie Liu, Yunhong Wang.)
鸟瞰图(BEV)表示已成为多视图3D目标检测的主流范式,表现了出色的感知能力。然而,现有方法忽略了BEV表示的几何质量,使其处于低分辨率状态,无法恢复场景的真实几何信息。本文发现了以前限制BEV表示几何质量的方法的缺陷,提出了Radial-Cartesian BEV Sampling(RC-Sampling),在有效生成高分辨率密集BEV表示以恢复细粒度几何信息方面优于其他特征变换方法。此外还设计了一种新的In-Box标签,以替代传统的从LiDAR点生成的深度标签。这个标签反映了物体的实际几何结构,而不仅仅是它们的表面,将现实世界的几何信息注入到BEV表示中。结合In-Box标签,提出Centroid-Aware Inner Loss(CAI Loss)来捕捉物体的内部几何结构。将上述模块集成到一个新的多视图3D目标检测框架GeoBEV中,其在nuScenes测试集上取得了最佳性能表现。
Bird's-Eye-View (BEV) representation has emerged as a mainstream paradigm for multi-view 3D object detection, demonstrating impressive perceptual capabilities. However, existing methods overlook the geometric quality of BEV representation, leaving it in a low-resolution state and failing to restore the authentic geometric information of the scene. In this paper, we identify the drawbacks of previous approaches that limit the geometric quality of BEV representation and propose Radial-Cartesian BEV Sampling (RC-Sampling), which outperforms other feature transformation methods in efficiently generating high-resolution dense BEV representations to restore fine-grained geometric information. Additionally, we design a novel In-Box Label to substitute the traditional depth label generated from the LiDAR points. This label reflects the actual geometric structure of objects rather than just their surfaces, injecting real-world geometric information into the BEV representation. In conjunction with the In-Box Label, Centroid-Aware Inner Loss (CAI Loss) is developed to capture the inner geometric structure of objects. Finally, we integrate the aforementioned modules into a novel multi-view 3D object detection framework, dubbed GeoBEV. It achieves a state-of-the-art result on the nuScenes test set.
5.TCAQ-DM: Timestep-Channel Adaptive Quantization for Diffusion Models. (Haocheng Huang, Jiaxin Chen, Jinyang Guo, Ruiyi Zhan, and Yunhong Wang)
扩散模型在图像和视频生成任务中取得了显著成功。然而,由于复杂的网络架构和迭代推理所需的较多时间步,这些模型在推理过程中通常需要大量的内存和时间开销。近年来,训练后量化(PTQ)技术被证明是一种有效的压缩模型的方法,通过将浮点操作量化为低位操作来降低推理成本。然而,现有大多数方法未能有效处理不同通道和时间步之间激活分布的巨大变化,以及扩散模型在量化和推理过程中的输入不一致性,从而具有较大的改进空间。为了解决上述问题,我们提出了一种新方法,即针对扩散模型的时间步-通道自适应量化。具体而言,我们提出了时间步-通道联合重参数化(TCR)模块,以平衡时间步和通道之间的激活范围,从而促进后续的重建过程。随后,我们采用了动态自适应量化(DAQ)模块,根据每个Post-Softmax层的特定分布类型选择最佳量化器来减小量化误差。此外,我们提出了一种渐进式对齐重建(PAR)策略,以减轻输入不匹配带来的偏差。在多种基准和不同扩散模型上的大量实验表明,本文方法在大多数情况下显著优于现有的最先进方法。
Diffusion models have achieved remarkable success in the image and video generation tasks. Nevertheless, they often require a large amount of memory and time overhead during inference, due to the complex network architecture and considerable number of timesteps for iterative diffusion. Recently, the post-training quantization (PTQ) technique has proved a promising way to reduce the inference cost by quantizing the float-point operations to low-bit ones. However, most of them fail to tackle with the large variations in the distribution of activations across distinct channels and timesteps, as well as the inconsistent of input between quantization and inference on diffusion models, thus leaving much room for improvement. To address the above issues, we propose a novel method dubbed Timestep-Channel Adaptive Quantization for Diffusion Models (TCAQ-DM). Specifically, we develop a timestep-channel joint reparameterization (TCR) module to balance the activation range along both the timesteps and channels, facilitating the successive reconstruction procedure. Subsequently, we employ a dynamically adaptive quantization (DAQ) module that mitigate the quantization error by selecting an optimal quantizer for each post-Softmax layers according to their specific types of distributions. Moreover, we present a progressively aligned reconstruction (PAR) strategy to mitigate the bias caused by the input mismatch. Extensive experiments on various benchmarks and distinct diffusion models demonstrate that the proposed method substantially outperforms the state-of-the-art approaches in most cases.
6.3D2-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling. (Zichen Tang, Hongyu Yang, Hanchen Zhang, Jiaxin Chen, and Di Huang)
神经隐式表示和可微渲染的进展显著提升了从稀疏的多视角RGB视频中学习可动画3D虚拟人物的能力。然而,当前将观测空间映射到标准空间的方法在捕捉姿态相关细节和新姿态泛化能力等方面仍面临挑战。虽然扩散模型在2D图像生成中展现了显著的zero-shot能力,但如何利用其实现从2D输入生成可动画的3D虚拟人物尚未被充分探索。本研究提出了3D2-Actor,一个基于姿态条件的三维感知人体建模框架,其中包含若干交替进行的二维去噪与三维修正步骤。二维去噪器在姿态提示的引导下,一方面能够生成富有细节的多视角图像,为后续高保真三维重建提供必要的数据,另一方面还可以精细化前一步三维修正的输出结果。基于高斯的三维修正器通过两阶段投影策略和新颖的局部坐标表示,能够渲染出具有高度三维一致性的图像。此外,我们还提出了一种新的帧间采样策略,能够提升合成视频的帧间连贯性。实验结果表明,本方法在高保真虚拟人物建模上表现出色,并对未见姿态具有较强的泛化能力。
Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multi-view RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D2-Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D2-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses.
7.Multi-modal Deepfake Detection via Multi-task Audio-Visual Prompt Learning. (Hui Miao, Yuanfang Guo, Zeming Liu, and Yunhong Wang)
近几年,随着多模态深度伪造视频的恶意使用和传播,研究人员逐步开始对多模态深度伪造检测展开研究。但是,现有的大多数方法均基于有限的语音视频数据来学习网络的全部参数,并且通常仅引入粗粒度的音视频一致性约束,这极大的限制了音视频深度伪造检测的泛化能力。为了解决这些问题,本文利用语音和视觉基础模型,提出了第一个基于音视频提示学习的多模态深度伪造检测方法。具体来说,我们构建了一个双流多任务学习架构,并提出了序列视觉提示和短时音频提示来提取多模态特征,这些特征可以进一步在帧级对齐,并用于后续的细粒度跨模态特征匹配和融合。考虑到真实数据中视觉内容和音频信号存在天然的对应关系,我们提出了一种帧级跨模态特征匹配损失函数来学习细粒度的音视频一致性。实验表明我们的方法的具备有效性和较好的泛化性。
With the malicious use and dissemination of multi-modal deepfake videos, researchers start to investigate multi-modal deepfake detection. Unfortunately, most of the existing methods tune all parameters with limited speech video datasets and are trained under coarse-grained consistency supervision, which hinder the generalization ability in practical scenarios. To solve these problems, in this paper, we propose the first multi-task audio-visual prompt learning method for multi-modal deepfake video detection, by exploiting multiple foundation models. Specifically, we construct a two-stream multi-task learning architecture and propose sequential visual prompts and short-time audio prompts to extract multi-modal features, which can be aligned at the frame level and utilized in subsequent fine-grained feature matching and fusion. Due to the natural alignment of visual content and audio signal in real data, we propose a frame-level cross-modal feature matching loss function to learn the fine-grained audio-visual consistency. Comprehensive experiments demonstrate the effectiveness and generalization ability of our method against the state-of-the-art methods.