摘要

Recently, graph convolutional networks (GCNs) play a critical role in skeleton-based human action recognition. However, most GCN-based methods still have two main limitations: (1) The semantic-level adjacency matrix of the skeleton graph is difficult to be manually defined, which restricts the perception field of GCN and limits its ability to extract the spatial-temporal features. (2) The velocity information of human body joints cannot be efficiently used and fully exploited by GCN, because GCN does not represent the correlation between the velocity vectors explicitly. To address these issues, we propose a graph-aware transformer (GAT), which can make full use of the velocity information and learn discriminative spatial-temporal motion features from the sequence of the skeleton graphs in a data-driven way. Besides, similar to the GCN-based model, our GAT also considers the prior structures of the human body including the link-aware structure and the part-aware structure. Extensive experiments on three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics-Skeleton, demonstrated that the proposed GAT obtains significant improvement compared to the GCN-based baseline for skeleton action recognition.

  • 单位
    武汉大学