
IRIP实验室今年共有5篇论文被欧洲计算机视觉国际会议ECCV 2024接收!ECCV与CVPR和ICCV并称为计算机视觉方向上的三大顶级会议,每两年一次。此次会议共录取2395篇论文。会议将于2024年9月29日至10月4日在意大利米兰召开。


1.Multi-modal Relation Distillation for Unified 3D Representation Learning. (Huiqun Wang, Yiping Bao, Zeming Li, Panwang Pan, Xiao Liu, Ruijie Yang, and Di Huang)


Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning multi-modal features across 3D shapes, corresponding 2D images, and language descriptions. However, this straightforward alignment often overlooks the intricate structural relationships among the samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pretraining framework designed to effectively distill state-of-the-art large multi-modal models into 3D backbones. MRD focuses on distilling both the intra-relations within each modality and the cross-relations between different modalities, aiming to produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering state-of-the-art performance.

2.Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes. (Zhi Cai, Yingjie Gao, Yaoyan Zheng, Nan Zhou, and Di Huang)


In computer vision, pedestrian detection is an important task that finds its application in many scenarios. However, obtaining extensive labels can be challenging, especially in crowded scenes. Current solutions to this problem often rely on massive unlabeled data and extensive training to achieve promising results. In this paper, we propose Crowd-SAM, a novel few-shot object detection/segmentation framework that utilizes SAM for pedestrian detection. CrowdSAM combines the advantages of DINO and SAM, to localize the foreground first and then generate dense prompts conditioned on the heatmap. Specifically, we design an efficient prompt sampler to deal with the dense point prompts and a Part-Whole Discrimination Network that re-scores the decoded masks and selects the proper ones. Our experiments are conducted on the CrowdHuman and CityScapes benchmarks. On CrowdHuman, CrowdSAM achieves 78.4 AP, which is comparable to fully supervised object detectors, and SOTA in few-shot object detectors, with 10-shot labeled images and less than 1M learnable parameters.

3.AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer. (Zhuguanyu Wu, Jiaxin Chen, Hanwen Zhong, Di Huang, and Yunhong Wang)


In post-training quantization (PTQ) of Vision Transformers, the quantization of activations that follow power-law distributions, such as attention maps, has continuously been a significant topic. Existing methods primarily employ non-uniform logarithmic quantizers for quantizing power-law distributions. However, these logarithmic quantizers struggle to dynamically adjust according to data distribution and quantization bit-width, resulting in severe accuracy loss in low-bit quantization. To address these issues, we introduce a hardware-friendly base-adaptive logarithmic quantizer, dubbed AdaLog, which adapts the logarithmic base to accommodate the power-law-like distribution of activations and simultaneously allows for hardware-friendly quantization and de-quantization through lookup table operations. By incorporating bias reparameterization techniques, we successfully apply the AdaLog quantizer to post-GELU activation quantization. Additionally, we propose a Fast Progressive Combining Search (FPCS) strategy to accurately determine quantization parameters. We extensively verify the effectiveness of our method across various tasks, including classification, detection, and segmentation.

4.MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection. (Ziyue Huang, Yongchao Feng, Qingjie Liu, and Yunhong Wang)


Detection pre-training methods for the DETR series detector have been extensively studied in natural scenes, e.g., DETReg. However, the detection pre-training remains unexplored in remote sensing scenes. In existing pre-training methods, alignment between object embeddings extracted from a pre-trained backbone and detector features is significant. However, due to differences in feature extraction ways, a pronounced feature discrepancy still exists and hinders the pre-training performance. The remote sensing images with complex environments and more densely distributed objects exacerbate the discrepancy. In this work, we propose a novel Mutually optimizing pre-training framework for remote sensing object Detection, dubbed as MutDet. In MutDet, we propose a systemic solution against this challenge. Firstly, we propose a mutual enhancement module, which fuses the object embeddings and detector features bidirectionally in the last encoder layer, enhancing their information interaction. Secondly, contrastive alignment loss is employed to guide this alignment process softly and simultaneously enhances detector features' discriminativity. Finally, we design an auxiliary siamese head to mitigate the task gap arising from the introduction of enhancement module. Comprehensive experiments on various settings show new state-of-the-art transfer performance. The improvement is particularly pronounced when data quantity is limited.

5.FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection. (Zheng Jiang, Jinqing Zhang, Yanan Zhang, Qingjie Liu, Zhenghui Hu, Baohui Wang, and Yunhong Wang)


Although multi-view 3D object detection based on the Bird's-Eye-View (BEV) paradigm has garnered widespread attention as an economical and deployment-friendly perception solution for autonomous driving, there is still a performance gap compared to LiDAR-based methods. In recent years, several cross-modal distillation methods have been proposed to transfer beneficial information from teacher models to student models, with the aim of enhancing performance. However, these methods face challenges due to discrepancies in feature distribution originating from different data modalities and network structures, making knowledge transfer exceptionally challenging. In this paper, we propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies, maintaining remarkable distillation effects without the need for pre-trained teacher models or cumbersome distillation strategies. Additionally, we design two Point Cloud Intensification (PCI) strategies to compensate for the sparsity of point clouds by frame combination and pseudo point assignment. Finally, we develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features by predicted elliptical Gaussian heatmap, further improving the model's performance. We integrate all the above innovations into a unified framework named FSD-BEV. Extensive experiments on the nuScenes dataset exhibit that FSD-BEV achieves state-of-the-art performance, highlighting its effectiveness.