实验室5篇论文被欧洲计算机视觉国际会议ECCV接收

IRIP实验室今年共有5篇论文被欧洲计算机视觉国际会议ECCV 2024接收!ECCV与CVPR和ICCV并称为计算机视觉方向上的三大顶级会议,每两年一次。此次会议共录取2395篇论文。会议将于2024年9月29日至10月4日在意大利米兰召开。

接收论文简介如下:

1.Multi-modal Relation Distillation for Unified 3D Representation Learning. (Huiqun Wang, Yiping Bao, Zeming Li, Panwang Pan, Xiao Liu, Ruijie Yang, and Di Huang)

三维点云多模态预训练通过对齐样本在点云模态、图像模态和文本模态下的表征来完成三维数据的跨模态表征构建。现有方法主要关注于样本级别的多模态表征对齐,而忽略了预训练图文空间内不同样本之间的复杂表征结构关系。为了解决这个问题,我们提出了一种多模态关系蒸馏框架,通过探究模态间和模态内的关系表示方式和动态关系蒸馏方法,得到了更具判别性的三维点云表示。所提出的方法在下游三维零样本分类任务和跨模态检索任务方面取得了显著的性能提升,在相同参数量条件下达到了当前最优性能。

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning multi-modal features across 3D shapes, corresponding 2D images, and language descriptions. However, this straightforward alignment often overlooks the intricate structural relationships among the samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pretraining framework designed to effectively distill state-of-the-art large multi-modal models into 3D backbones. MRD focuses on distilling both the intra-relations within each modality and the cross-relations between different modalities, aiming to produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering state-of-the-art performance.

2.Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes. (Zhi Cai, Yingjie Gao, Yaoyan Zheng, Nan Zhou, and Di Huang)

目前主流的行人检测方法往往依赖大量标注数据,而密集场景中的数据标注非常费时费力。为了解决这一问题,本文基于通用分割模型SAM提出了一个少样本行人检测/分割方法Crowd-SAM。首先,为了定位行人出现的位置,Crowd-SAM使用DINO提取语义丰富特征,并通过一个可学习的分割头得到粗略的行人分布热力图。然后,本文设计了一种高效的提示采样器来处理密集的点提示,在保证性能的同时大幅提升了SAM的解码速度。最后,为了处理密集场景中复杂的行人遮挡问题,本文设计了一个整体-局部区分网络,对SAM解码的多个具有语义歧义性的掩码进行选择,得到高质量的掩码预测结果。Crowd-SAM在仅使用少量标注样本的情况下,通过高效的微调策略,在CrowdHuman、CityScape等主流行人检测数据集上取得了与全监督方法可比的性能,并且大幅超过目前先进的少样本检测方法。

In computer vision, pedestrian detection is an important task that finds its application in many scenarios. However, obtaining extensive labels can be challenging, especially in crowded scenes. Current solutions to this problem often rely on massive unlabeled data and extensive training to achieve promising results. In this paper, we propose Crowd-SAM, a novel few-shot object detection/segmentation framework that utilizes SAM for pedestrian detection. CrowdSAM combines the advantages of DINO and SAM, to localize the foreground first and then generate dense prompts conditioned on the heatmap. Specifically, we design an efficient prompt sampler to deal with the dense point prompts and a Part-Whole Discrimination Network that re-scores the decoded masks and selects the proper ones. Our experiments are conducted on the CrowdHuman and CityScapes benchmarks. On CrowdHuman, CrowdSAM achieves 78.4 AP, which is comparable to fully supervised object detectors, and SOTA in few-shot object detectors, with 10-shot labeled images and less than 1M learnable parameters.

3.AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer. (Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, and Yunhong Wang)

在视觉Transformer的训练后量化中,对于注意力图等服从幂律分布的激活量化一直是一个重要课题。现有的幂律分布量化方法基本是使用基于对数的非均匀量化器。然而,现有的对数量化器难以根据数据分布与量化位宽动态调整,导致在低比特量化中精度损失严重。针对以上问题,我们引入了一种硬件友好的底数自适应对数量化器AdaLog,能够根据数据分布自适应地选择合适的对数底数,并通过查表操作实现硬件友好的推理。通过引入偏置重参数化技术,我们成功地将AdaLog量化器用于post-GELU激活量化。此外,我们提出了一种快速联合超参数搜索算法用于准确获取量化超参数。我们在分类、检测、分割等任务上广泛验证了我们方法的有效性。

In post-training quantization (PTQ) of Vision Transformers, the quantization of activations that follow power-law distributions, such as attention maps, has continuously been a significant topic. Existing methods primarily employ non-uniform logarithmic quantizers for quantizing power-law distributions. However, these logarithmic quantizers struggle to dynamically adjust according to data distribution and quantization bit-width, resulting in severe accuracy loss in low-bit quantization. To address these issues, we introduce a hardware-friendly base-adaptive logarithmic quantizer, dubbed AdaLog, which adapts the logarithmic base to accommodate the power-law-like distribution of activations and simultaneously allows for hardware-friendly quantization and de-quantization through lookup table operations. By incorporating bias reparameterization techniques, we successfully apply the AdaLog quantizer to post-GELU activation quantization. Additionally, we propose a Fast Progressive Combining Search (FPCS) strategy to accurately determine quantization parameters. We extensively verify the effectiveness of our method across various tasks, including classification, detection, and segmentation.

4.MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection. (Ziyue Huang, Yongchao Feng, Qingjie Liu, and Yunhong Wang)

DETR系列检测器的检测预训练方法在自然场景中已经得到了广泛研究。然而在遥感场景中,检测预训练仍未被探索。在现有的预训练方法中,从预训练骨干网络中提取的对象嵌入与检测器特征之间的对齐十分重要。然而,由于特征提取方法的不同,两者存在显著的特征差异,影响了预训练的效果。具有复杂环境和更密集分布对象的遥感图像加剧了这种差异。在这项工作中,我们为遥感图像目标检测提出了一个新颖的互优化预训练框架MutDet。在MutDet中,我们提出了一个系统性解决方案来应对这一挑战。首先,我们提出了一个互增强模块,在最后的编码器层中双向融合了对象嵌入和检测器特征,增强了它们之间的信息交互。其次,我们采用对比对齐损失来柔和地引导这一对齐过程,并同时增强检测器特征的区分性。最后,我们设计了一个辅助孪生头部来缓解引入增强模块所带来的任务差异问题。我们在各种设置上的综合实验表明该框架具备最佳的迁移性能。在数据量有限的情况下,提升尤为显著。

Detection pre-training methods for the DETR series detector have been extensively studied in natural scenes, e.g., DETReg. However, the detection pre-training remains unexplored in remote sensing scenes. In existing pre-training methods, alignment between object embeddings extracted from a pre-trained backbone and detector features is significant. However, due to differences in feature extraction ways, a pronounced feature discrepancy still exists and hinders the pre-training performance. The remote sensing images with complex environments and more densely distributed objects exacerbate the discrepancy. In this work, we propose a novel Mutually optimizing pre-training framework for remote sensing object Detection, dubbed as MutDet. In MutDet, we propose a systemic solution against this challenge. Firstly, we propose a mutual enhancement module, which fuses the object embeddings and detector features bidirectionally in the last encoder layer, enhancing their information interaction. Secondly, contrastive alignment loss is employed to guide this alignment process softly and simultaneously enhances detector features' discriminativity. Finally, we design an auxiliary siamese head to mitigate the task gap arising from the introduction of enhancement module. Comprehensive experiments on various settings show new state-of-the-art transfer performance. The improvement is particularly pronounced when data quantity is limited.

5.FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection. (Zheng Jiang, Jinqing Zhang, Yanan Zhang, Qingjie Liu, Zhenghui Hu, Baohui Wang, and Yunhong Wang)

虽然基于鸟瞰图(BEV)范式的多视角3D目标检测作为自动驾驶的一种经济且易于部署的感知解决方案受到了广泛关注,但与基于激光雷达的方法相比,其性能仍有差距。近年来,提出了几种跨模态知识蒸馏方法,目的是将教师模型中的有益信息转移到学生模型中,以提高性能。然而,这些方法由于不同数据模态和网络结构导致的分布差异而面临挑战,使得知识转移异常困难。在本文中,我们提出了一种前景自蒸馏(FSD)方案,有效地避免了分布差异问题,无需预先训练的教师模型或复杂的蒸馏策略,就能保持显著的蒸馏效果。此外,我们设计了两种点云强化(PCI)策略,通过帧组合和伪点分配来弥补点云的稀疏性。最后,我们开发了一个多尺度前景增强(MSFE)模块,通过预测的椭圆形高斯热图提取和融合多尺度前景特征,进一步提高模型的性能。我们将所有上述创新集成到一个名为FSD-BEV的统一框架中。在nuScenes数据集上的广泛实验表明,FSD-BEV实现了最先进的性能,突显了其有效性。

Although multi-view 3D object detection based on the Bird's-Eye-View (BEV) paradigm has garnered widespread attention as an economical and deployment-friendly perception solution for autonomous driving, there is still a performance gap compared to LiDAR-based methods. In recent years, several cross-modal distillation methods have been proposed to transfer beneficial information from teacher models to student models, with the aim of enhancing performance. However, these methods face challenges due to discrepancies in feature distribution originating from different data modalities and network structures, making knowledge transfer exceptionally challenging. In this paper, we propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies, maintaining remarkable distillation effects without the need for pre-trained teacher models or cumbersome distillation strategies. Additionally, we design two Point Cloud Intensification (PCI) strategies to compensate for the sparsity of point clouds by frame combination and pseudo point assignment. Finally, we develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features by predicted elliptical Gaussian heatmap, further improving the model's performance. We integrate all the above innovations into a unified framework named FSD-BEV. Extensive experiments on the nuScenes dataset exhibit that FSD-BEV achieves state-of-the-art performance, highlighting its effectiveness.