Feature-Guided Spatial Attention Upsampling for Real-Time Stereo Matching Network

摘要

In this article, we propose an end-to-end real-time stereo matching network (RTSMNet). RTSMNet consists of three modules. The global and local feature extraction (GLFE) module captures the hierarchical context information and generates the coarse cost volume. The initial disparity estimation module is a compact three-dimensional convolution architecture aiming to produce the low-resolution (LR) disparity map rapidly. The feature-guided spatial attention upsampling module takes the LR disparity map and the shared features from the GLFE module as guidance, first estimates residual disparity values and then an attention mechanism is developed to generate context-aware adaptive kernels for each upsampled pixel. The adaptive kernels emphasize higher attention weights on the reliable area, which can significantly reduce blurred edges and recover thin structures. The proposed networks achieve 66 similar to 175 fps on a 2080Ti and 11 similar to 42 fps on edge computing devices, with competitive accuracy compared to state-of-the-art methods on multiple benchmarks.

关键词

Feature extraction Three-dimensional displays Convolution Stereo vision Benchmark testing Real-time systems Computer architecture computer vision stereo matching deep convolutional neural network depth estimation disparity estimation attention mechanism upsample method real-time method