Tags:Action recognition, Graph, Skeleton-based and Transformer
Abstract:
Skeleton-based action recognition has made great strides with the use of graph convolutional neural networks(GCN) to model correlations among body joints. However, GCN has limitations in establishing long-term dependencies and is constrained by the natural connections of human body joints. To overcome these issues, we propose a Graph relative TRansformer(GTR) that captures temporal features through learnable topology and invariant joint adjacency graphs. GTR provides a high-level representation of the spatial skeleton structure that harmoniously fits into the time series. Moreover, a Multi-Stream Graph Transformer (MS-GTR) is introduced to integrate various dynamic information for an end-to-end human action recognition task. The MS-GTR applies a double branch structure, where the GTR is implemented as the master branch to extract joint-level and bone-level features, and an auxiliary branch processes lightweight kinematic content. Finally, a cross-attention mechanism links the master branch and the auxiliary branch to complement the information in stages. Experimental results on the HDM05 and NTU RGB+D datasets demonstrate the potential of the proposed MS-GTR model for improving action recognition.
MS-GTR: Multi-Stream Graph Transformer for Skeleton-Based Action Recognition