1. Denoising Diffusion Autoencoders are Unified Self-supervised Learners (Weilai Xiang, Hongyu Yang, Di Huang, and Yunhong Wang) oral
Inspired by generative pre-training and denoising autoencoders, this paper investigates whether diffusion models can acquire discriminative capabilities by pre-training on image generation. The paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), have already learned strongly linear-separable representations at its intermediate layers without auxiliary objectives or modifications. To verify this, we perform linear probe and fine-tuning evaluations on image classification datasets. Our diffusion-based approach achieves 95.9% and 50.0% linear probe accuracies on CIFAR-10 and Tiny-ImageNet, respectively, and is comparable to masked autoencoders and contrastive learning for the first time.
2.Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection (Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, and Di Huang)
Anomaly detection (AD), aiming to detect samples that deviate from the training distribution, is essential in safety-critical applications. Due to the intractability of collecting all kinds of anomalies, it is practical to study the setting where outlier detectors are developed solely based on in-distribution data. Though recent self-supervised learning based attempts achieve promising results by creating virtual outliers, their training objectives are less faithful to AD which requires both a concentrated inlier distribution and a dispersive outlier distribution. In this paper, we propose Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation (UniCon-HA), in consideration of both the above requirements. Specifically, we explicitly encourage the concentration of inliers and the dispersion of virtual outliers via supervised and unsupervised contrastive losses, respectively. Considering that standard contrastive data augmentation for generating positive views may induce outliers, we additionally introduce a soft mechanism to re-weight each augmented inlier according to its deviation from the inlier distribution, to ensure a purified concentration. Moreover, to prompt a higher concentration, inspired by curriculum learning, we adopt an easy-to-hard hierarchical augmentation strategy and perform contrastive aggregation at different depths of the network based on the strengths of data augmentation. Our method is evaluated under three AD settings including unlabeled one-class, unlabeled multi-class, and labeled multi-class, demonstrating our consistent superiority over other competitors.
3.DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration (Nan Zhou, Jiaxin Chen, and Di Huang)
The visual models pretrained on large-scale benchmarks encode general knowledge and prove effective in building more powerful representations for downstream tasks.
This paper proposes a novel fine-tuning framework, namely Distribution Regularization with Semantic Calibration (DR-Tune), which aims to mitigate over-fitting during fine-tuning. It employs distribution regularization by enforcing the downstream task head to decrease its classification error on the pretrained feature distribution, which prevents it from being over-fitting while enabling sufficient training of downstream encoders. Furthermore, to alleviate the interference by semantic drift, we develop the Semantic Calibration (SC) module to align the global shape and the class centers of the pretrained and the downstream feature distributions. Extensive experiments on widely used image classification datasets show that DR-Tune consistently improves the performance when combing with various backbones under different pretraining strategies.
4.SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection (Jinqing Zhang, Yanan Zhang, Qingjie Liu, and Yunhong Wang)
The previous BEV 3D object detection methods convert all the image information to BEV space. It makes it easy that the large proportion of background information submerges the valid foreground information. In this paper, we propose a method that only preserves the foreground information. First, the image features are segmented into the foreground and background, and then the background information is filtered out when generating BEV features. Based on this, the BEV-Paste data augmentation strategy and the multi-scale cross-task head further improve the generalization ability and detection accuracy of the model.