Video action recognition with Key-detail Motion Capturing based on motion spectrum analysis and multiscale feature fusion
摘要
At present, existing research works on action recognition are still not ideal, when most of the video content is redundant such as video clips without any object motion, and human actions in the video are complex. The reasons are as follows: (1) Most of them lack attention to key-motion information of the video, thus irrelevant information will be input into the model. (2) And there is a lack of interaction between video spatial and temporal information, which may cause the loss of detailed motion information in the video. In this paper, we propose a Key-detail Motion Capturing Network (K-MCN) to solve these problems, which contains two modules. The first one is the Video Key-motion Spectrum Analyzer (VKSA) module. In this module, the video optical flow can be subjected to frequency spectrum analysis, filtering and clustering to extract the key-motion frames. The second one is the Multiscale Motion Spatiotemporal Interaction module, which allows multi-scale modeling and fusion of spatial and temporal features extracted from key-motion frames, enabling the network to realize the interaction and supplement of multiscale spatiotemporal information. Finally, we conducted extensive experiments on the UCF101, HMDB51 and Something-SomethingV1 datasets, and the results showed that our method achieves better performance compared with other state-of-the-art methods.
