摘要
Most of the existing human pose estimation methods improve accuracy by constantly increasing computational resources. However, balancing the efficiency and efficacy of the model is the key to enhancing the real application value. In this work, we present a Fast and Effective Transformer model to ensure the efficiency and efficacy of the model, called FET. Specifically, the FET consists of three parts: Feature Extraction Module (FEM), Feature Interaction Module (FIM) and Feature Decode Module (FDM). The FEM is used to efficiently extract low-level features from input images. Unlike CNN-based strategies, the FIM enables our model to capture global dependencies by self-attention, thus improving the accuracy for human pose estimation. The FDM is a multistage way that gradually recovers the size of the features to obtain a higher-quality target heatmap. In addition, Feature Squeeze Attention is introduced in the FET to further improve the overall performance of our model. Extensive experiments show that our method is 1.7x and 7x faster than SimpleBaseline and HRNet-32, respectively, while achieving comparable or even better results with the most state-of-the-art methods on the COCO dataset and the MPII dataset.
