STA-TAD: Spatial-Temporal Adapter on ViT for Temporal Action Detection

Title:STA-TAD: Spatial-Temporal Adapter on ViT for Temporal Action Detection

Authors:Zhongguang Zhang, Tingwei Wu, Qifei Zhang, Li Wang and Zhao Wang

Conference:CASA 2025

Tags:Human Motion, Spatial-Temporal Adapter, Temporal Action Detection and ViT

Abstract:

Temporal Action Detection (TAD) aims to localize all action instances and recognize their categories in a long untrimmed video. TAD plays an essential role in the long-term video understanding. Recently, TAD has achieved significant performance improvement with end-to-end training adapting pre-trained Vision Transformer (ViT) models. However, the memory bottleneck limits the implementation of powerful video models, which inevitably restricts TAD performance. In this paper, we present a novel Adapter based method, a typical parameter-efficient fine-tuning (PEFT) technique, to address this issue. The key to our approach lies in our proposed spatial-temporal adapter (STA), which is a novel lightweight module. The backbone can adapt to the TAD task during end-to-end training by only updating the parameters in STA. In addition, STA also leads to better TAD representation with its sparse self-attention design for exploiting local temporal information. We evaluate our model across 3 benchmark datasets THUMOS14, ActivityNet-1.3 and Charades. The proposed STA-TAD outperforms SOTA methods in most cases.