Y. Lou et al., "Compact Deep Invariant Descriptors for Video Retrieval," 2017 Data Compression Conference (DCC), Snowbird, UT, 2017, pp. 420-429. doi: 10.1109/DCC.2017.31
Abstract:
With emerging demand for large-scale video analysis, the Motion Picture Experts Group (MPEG) initiated the Compact Descriptor for Video Analysis (CDVA) standardization in 2014. In this work, we develop novel deep-learning features and incorporate them into the well-established CDVA evaluation framework to study its effectiveness in video analysis. In particular, we propose a Nested Invariance Pooling (NIP) method to obtain compact and robust Convolutional Neural Network (CNNs) descriptors. The CNNs descriptors are generated by applying three different pooling operations to the feature maps of CNNs in a nested way towards rotation and scale invariant feature representation. In particular, the rational, advantages and performance on the combination of CNNs and handcrafted descriptors are provided to better investigate the complementary effects of deep learnt and handcrafted features. Extensive experimental results show that the proposed CNNs descriptors outperform both state-of-the-art CNNs descriptors and canonical handcrafted descriptors adopted in CDVA Experimental Model (CXM) with significant mAP gains of 11.3% and 4.7%, respectively. Moreover, the combination of NIP derived deep invariant descriptors and handcrafted descriptors not only fulfills the lowest bitrate budget of CDVA, but also significantly advances the performance of CDVA core techniques.