摘要
Semantic segmentation is important for the accuracy of target detection. Semantic labels are difficult to obtain for real driving; however, they are easy to obtain in virtual datasets. So this paper presents an adaptive joint training strategy based on real and virtual datasets: (1) building multi(-)modal fusion networks using image, depth and semantic information. (2) A joint training strategy of virtual and real datasets and data sharing is used for semantic information, and an adaptive optimizer is provided. The monocular detection network obtained by training with this strategy has a large improvement in its effectiveness relative to the conventional network.