FMSWFormer：基于频率分离和自适应多尺度窗口的视觉Transformer-《集美大学学报（自然科学版）》

文章信息/Info

Title:: FMSWFormer:Visual Transformer with Frequency Separation and Adaptive Multi-Scale Window Attention

作者:: 蔡岱立; 谢维波; 华侨大学计算机科学与技术学院，福建厦门 361021

Author(s):: CAI Daili; XIE Weibo; College of Computer Science and Technology， Huaqiao University， Xiamen 361021， China

关键词:: 窗口自注意力机制; 多尺度特征提取; 深度学习; 卷积神经网络; 图像高低频解耦

Keywords:: window self-attention; multi-scale feature extraction; deep learning; convolutional neural networks; image high and low frequency decoupling

分类号:: -

DOI:: -

文献标志码:: A

摘要:: 由于Vision Transformer具有二次方的patch复杂度和较差的局部归纳偏置，导致需要大量的数据和更专业的数据增强策略及更多的训练技巧来超越高效卷积网络。为了解决这些问题，从多尺度特征提取和图像频率的角度进行研究，提出具有轻量注意力机制的FMSWFormer。FMSWFormer采用卷积-自注意力机制混合模块构建起不同频率间的通信，通过窗口划分实现局部注意力机制，以此限制过高的计算成本；参考自适应尺度感知卷积的做法，并创新性地将多尺度算子引入到自注意力计算中，从而实现了多头自注意力机制的自适应尺度感知能力。在各种基准识别任务数据集上进行广泛的实验，结果表明了FMSWFormer的有效性，在多个数据集中都取得了优越的性能，且不增加时间成本。其中在CIFAR100数据集上，FMSWFormer比SepViT的性能高出4.2%，延迟降低了47.8%；在参数量比EfficientNetv2减少了22%的情况下，FMSWFormer的性能依然能高出3.94%。

Abstract:: Vision Transformer‘’s quadratic patch complexity and weaker local inductive bias， require a large amount of data， more sophisticated data augmentation strategies， and additional training tricks to surpass efficient convolutional networks.To address these issues， this paper investigates multi-scale feature extraction and image frequency perspectives and introduces FMSWFormer.FMSWFormer employs convolution-self-attention hybrid modules to establish communication between different frequencies and implements local attention mechanisms through window partitioning to constrain excessive computational costs.One of the main contributions of the paper is the incorporation of multi-scale operators into self-attention computation.This technique is inspired by adaptive-scale perception convolution， and it serves to enhance the adaptive-scale perception capabilities of multi-head self-attention mechanisms.By doing so， FMSWFormer effectively combines the strong local inductive bias of CNNs with the dynamic long-range dependency modeling abilities of transformers.Extensive experiments on various benchmark recognition task datasets demonstrate the effectiveness of FMSWFormer， which achieves superior performance on multiple datasets without increasing time costs.Notably， FMSWFormer outperforms SepViT by 4.2% on the CIFAR100 dataset while reducing latency by 47.8%.Even with 22% fewer parameters than EfficientNetv2， FMSWFormer’s performance still surpasses it by 3.94%.

《集美大学学报（自然科学版）》[ISSN:1007-7405/CN:35-1186/N]

文章信息/Info

参考文献/References:

相似文献/References:

备注/Memo

常用功能

导航/Navigate

工具/Tools

统计/Statistics