An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification

Page view(s)

Checked on Aug 21, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/21929

Title:

An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification

Journal Title:

IEEE Transactions on Circuits and Systems for Video Technology

DOI:

10.1109/TCSVT.2025.3535818

Publication URL:

https://doi.org/10.1109/tcsvt.2025.3535818

Authors:

Yueting Huang, Zhenzhe Hechen, Mingliang Zhou, Zhengguo Li, Sam Kwong

Keywords:

Publication Date:

10 June 2025

Citation:

Huang, Y., Hechen, Z., Zhou, M., Li, Z., & Kwong, S. (2025). An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification. IEEE Transactions on Circuits and Systems for Video Technology, 35(6), 5993–6006. https://doi.org/10.1109/tcsvt.2025.3535818

Abstract:

Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intra-class diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multi-head self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FALViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at https://github.com/Yueting-Huang/FAL-ViT.

License type:

Publisher Copyright

Funding Info:

There was no specific funding for the research done

Description:

© 2025 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

URI:

https://oar.a-star.edu.sg/communities-collections/articles/21929

ISSN:

1051-8215
1558-2205

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
ieeetcsvt-visual-classification.pdf	4.67 MB	PDF	Open