摘要

Modern models have shown impressive capabilities in visual reasoning tasks. However, the interpretability of their decision-making processes remains a challenge, causing uncertainty in their reliability. In response, we present the Multimodal Selective Generation Network (MSGeN), a novel approach to enhancing interpretability and transparency in visual reasoning. MSGeN can generate explanations that seamlessly integrate diverse modal information, providing a comprehensive and intuitive understanding of its decisions. The model consists of five collaborative components: (1) the Multimodal Encoder, which encodes and fuses input data; (2) the Reasoner, which is responsible for generating stepwise inference states; (3) the Selector, which is utilized for selecting the modality for each step's explanation; (4) the Speaker, which generates natural language descriptions; and (5) the Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched with natural language context and visual cues. Our extensive experimentation demonstrates that MSGeN surpasses existing multimodal explanation generation models across various metrics, including BLEU, METEOR, ROUGE, CIDEr, SPICE, and Grounding. We also show detailed visual examples highlighting MSGeN's ability to generate comprehensive and coherent explanations, showcasing its effectiveness through practical case studies.

全文