Entropy guided attention network for weakly-supervised action localization

Page view(s)

Checked on Aug 10, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/18312

Title:

Entropy guided attention network for weakly-supervised action localization

Journal Title:

Pattern Recognition

DOI:

10.1016/j.patcog.2022.108718

Publication URL:

http://dx.doi.org/10.1016/j.patcog.2022.108718

Authors:

Yi Cheng, Ying Sun, Hehe Fan, Tao Zhuo, Joo-Hwee Lim, Mohan Kankanhalli

Keywords:

Software, Computer Vision and Pattern Recognition, artificial intelligence, Signal processing

Publication Date:

18 April 2022

Citation:

Cheng, Y., Sun, Y., Fan, H., Zhuo, T., Lim, J.-H., & Kankanhalli, M. (2022). Entropy guided attention network for weakly-supervised action localization. Pattern Recognition, 129, 108718. https://doi.org/10.1016/j.patcog.2022.108718

Abstract:

One major challenge of Weakly-supervised Temporal Action Localization (WTAL) is to handle diverse backgrounds in videos. To model background frames, most existing methods treat them as an additional action class. However, because background frames usually do not share common semantics, squeezing all the different background frames into a single class hinders network optimization. Moreover, the network would be confused and tends to fail when tested on videos with unseen background frames. To address this problem, we propose an Entropy Guided Attention Network (EGA-Net) to treat background frames as out-of-domain samples. Specifically, we design a two-branch module, where a domain branch detects whether a frame is an action by learning a class-agnostic attention map, and an action branch recognizes the action category of the frame by learning a class-specific attention map. By aggregating the two attention maps to model the joint domain-class distribution of frames, our EGA-Net can handle varying backgrounds. To train the class-agnostic attention map with only the video-level class labels, we propose an Entropy Guided Loss (EGL), which employs entropy as the supervision signal to distinguish action and background. Moreover, we propose a Global Similarity Loss (GSL) to enhance the action-specific attention map via action class center. Extensive experiments on THUMOS14, ActivityNet1.2 and ActivityNet1.3 datasets demonstrate the effectiveness of our EGA-Net.

License type:

Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Funding Info:

This research / project is supported by the Agency for Science, Technology and Research - AME Programmatic Funding
Grant Reference no. : A18A2b0046

Description:

URI:

https://oar.a-star.edu.sg/communities-collections/articles/18312

ISSN:

0031-3203

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
yicheng-patternrecogtnion-wtal-revised1-2-cameraready.pdf	3.07 MB	PDF	Open