| ||||
| ||||
![]() Title:STA-TAD: Spatial-Temporal Adapter on ViT for Temporal Action Detection Conference:CASA 2025 Tags:Human Motion, Spatial-Temporal Adapter, Temporal Action Detection and ViT Abstract: Temporal Action Detection (TAD) aims to localize all action instances and recognize their categories in a long untrimmed video. TAD plays an essential role in the long-term video understanding. Recently, TAD has achieved significant performance improvement with end-to-end training adapting pre-trained Vision Transformer (ViT) models. However, the memory bottleneck limits the implementation of powerful video models, which inevitably restricts TAD performance. In this paper, we present a novel Adapter based method, a typical parameter-efficient fine-tuning (PEFT) technique, to address this issue. The key to our approach lies in our proposed spatial-temporal adapter (STA), which is a novel lightweight module. The backbone can adapt to the TAD task during end-to-end training by only updating the parameters in STA. In addition, STA also leads to better TAD representation with its sparse self-attention design for exploiting local temporal information. We evaluate our model across 3 benchmark datasets THUMOS14, ActivityNet-1.3 and Charades. The proposed STA-TAD outperforms SOTA methods in most cases. STA-TAD: Spatial-Temporal Adapter on ViT for Temporal Action Detection ![]() STA-TAD: Spatial-Temporal Adapter on ViT for Temporal Action Detection | ||||
Copyright © 2002 – 2025 EasyChair |