摘要

Sign language translation (SLT) is a challenging weakly supervised task without word-level annotations. An effective method of SLT is to leverage multimodal complementarity and to explore implicit temporal cues. In this work, we propose a graph-based multimodal sequential embedding network (MSeqGraph), in which multiple sequential modalities are densely correlated. Specifically, we build a graph structure to realize the intra-modal and inter-modal correlations. First, we design a graph embedding unit (GEU), which embeds a parallel convolution with channel-wise and temporal-wise learning into the graph convolution to learn the temporal cues in each modal sequence and cross-modal complementarity. Then, a hierarchical GEU stacker with a pooling-based skip connection is proposed. Unlike the state-of-the-art methods, to obtain a compact and informative representation of multimodal sequences, the GEU stacker gradually compresses the channel d with multi-modalities m rather than the temporal dimension t. Finally, we adopt the connectionist temporal decoding strategy to explore the entire video's temporal transition and translate the sentence. Extensive experiments on the USTC-CSL and BOSTON-104 datasets demonstrate the effectiveness of the proposed method.