摘要
The high redundancy among keyframes is a critical issue for the prior summarizing methods in dealing with user-created videos. To address the critical issue, we present a Graph Attention Networks (GAT) adjusted Bi-directional Long Short-term Memory (Bi-LSTM) model for unsupervised video summarization. First, the GAT is adopted to transform an image's visual features into higher-level features by the Contextual Features based Transformation (CFT) mechanism. Specifically, a novel Salient-Area-Size-based spatial attention model is presented to extract frame-wise visual features on the observation that humans tend to focus on sizable and moving objects. Second, the higher-level visual features are integrated with semantic features processed by Bi-LSTM to refine the frame-wise probability of being selected as keyframes. Extensive experiments demonstrate that our method outperforms state-of-the-art methods.
-
单位武汉大学