Multi-level, multi-modal interactions for visual question answering over text in images

作者:Chen, Jincai; Zhang, Sheng; Zeng, Jiangfeng*; Zou, Fuhao; Li, Yuan-Fang; Liu, Tao; Lu, Ping
来源:WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25(4): 1607-1623.
DOI:10.1007/s11280-021-00976-2

摘要

Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.

  • 单位
    Huazhong University of Science and Technology

全文