摘要
Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.
-
单位Huazhong University of Science and Technology