实验室3篇论文被国际人工智能联合会议IJCAI-ECAI 2022接收.

IRIP实验室今年共有3论文被国际人工智能联合会议IJCAI-ECAI 2022接收!IJCAI 是人工智能领域中最主要的学术会议之一,原为单数年召开,自2015年起改为每年召开,本次IJCAIECAI一起召开。IJCAI官网显示,此次会议有4535篇的大会论文投稿,录取率仅为15%。此次会议将于20227月在维也纳召开。


1. SparseTT: Visual Tracking with Sparse Transformers. (Zhihong Fu, Zehua Fu, Qingjie Liu, Wenrui Cai, Yunhong Wang) (long oral presentation)


   Transformers have been successfully applied to the visual tracking task and significantly promote tracking performance. The self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers. However, self-attention lacks focusing on the most relevant information in the search regions, making it easy to be distracted by background. This paper relieves this issue with a sparse attention mechanism by focusing the most relevant information in the search regions, which enables a much accurate tracking. Furthermore, this paper introduces a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes, which further improve the tracking performance. Extensive experiments show that, without bells and whistles, our method significantly outperforms the state-of-the-art approaches on LaSOT, GOT-10k, TrackingNet, and UAV123, while running at 40 FPS. Notably, the training time of our method is reduced by 75% compared to that of TransT.

2. PACE: Predictive and Contrastive Embedding for Unsupervised Action Segmentation. (Jiahao Wang, Jie Qin, Yunhong Wang, Annan Li)


 Action segmentation, inferring temporal positions of human actions in an untrimmed video, is an important prerequisite for various video understanding tasks. Recently, unsupervised action segmentation (UAS) has emerged as a more challenging task due to the unavailability of frame-level annotations. Existing clustering- or prediction-based UAS approaches suffer from either over-segmentation or overfitting, leading to unsatisfactory results. To address those problems, we propose Predictive And Contrastive Embedding (PACE), a unified UAS framework leveraging both predictability and similarity information for more accurate action segmentation. On the basis of an auto-regressive transformer encoder, predictive embeddings are learned by exploiting the predictability of video context, while contrastive embeddings are generated by leveraging the similarity of adjacent short video clips. Extensive experiments on three challenging benchmarks demonstrate the superiority of our method, with up to 26.9% improvements in F1-score over the state of the art.

3.Representation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement. (Bing Li, Jiaxin Chen, Dongming Zhang, Xiuguo Bao, Di Huang)


    This paper proposes a feature representation learning method based on motion cues enhancement, denoising and multi-modality interaction at multiple scales for compressed video action recognition. Richer motion details are introduced through multi-scale block design, while the designed denoising module can be embedded to denoise coarse compressed motion modalities within multi-scale blocks, thus achieving the goal of enhancing compressed motion modalities. Finally, the static features (I-frames) and dynamic features (motion vectors and residuals) in compressed videos under different levels are interactively fused by the global multi-modallity attention module and the local spatio-temporal attention module to adjust the importance between different modalities under different actions, so as to enhance the final performance of the model. Experiments on the Kinetics400, HMDB-51 and UCF-101 datasetdemonstrate its superiority and effectiveness.