基于深度学习的无监督单目动态场景深度估计综述

程彬彬; 于英; 张磊; 王自全; 江志鹏

下载中心

优秀审稿专家

优秀论文

首页 > , Vol. , Issue () : -

摘要

全文摘要次数： 880 全文下载次数： 749

引用本文:

DOI:

10.11834/jrs.20233060

收稿日期:

2023-03-02

修改日期:

2023-10-23

PDF Free EndNote BibTeX

基于深度学习的无监督单目动态场景深度估计综述

程彬彬, 于英, 张磊, 王自全, 江志鹏

信息工程大学地理空间信息学院

摘要:

现实世界中不存在完全静态的场景，动态场景下的单目深度估计方法是指从单幅影像中同时获取动态前景和静态背景的深度信息，与传统双目估计方法相比具有运用灵活、成本较低等优势，有着极强的研究意义和广阔的发展前景，在三维重建、自动驾驶等下游任务中起着关键作用。深度学习技术迅速发展，无监督学习不使用真实数据标签，吸引众多学者的研究热情。国内外众多学者为了处理场景中的动态物体相继提出一系列无监督单目深度估计算法，为广大相关领域的研究者奠定了研究基础，但目前尚未有对上述方法进行综合分析的研究。针对这一问题，本文对基于深度学习的无监督单目动态场景深度估计技术进展情况进行了系统性梳理与总结，首先归纳了基于深度学习的无监督单目深度估计的基本模型，分析了动态物体是如何对场景深度估计产生的影响；其次，介绍了单目深度估计研究的常用数据集以及评价指标，对经典动态场景下单目深度估计模型进行了性能对比分析；然后，依据对动态物体的处理方式不同，分别从动态场景鲁棒深度估计和动态物体跟踪与深度估计两个研究方向，进行了总结与定量分析；最后对动态场景单目深度估计的未来发展方向进行了展望。

关键词:

动态场景单目深度估计无监督学习深度学习三维重建

Survey on unsupervised monocular depth estimation in dynamic scenes based on deep learning

Abstract:

In the real world, there are no completely static scenes. Monocular depth estimation in dynamic scenes refers to obtaining depth information of both dynamic foreground and static background from a single image, which has advantages over traditional stereo estimation methods in terms of flexibility and cost-effectiveness. It has strong research significance and broad development prospects, playing a key role in downstream tasks such as 3D reconstruction and autonomous driving. With the rapid development of deep learning technology, unsupervised learning without using real data labels has attracted the research enthusiasm of many scholars. Numerous scholars in the domestic and overseas have proposed a series of unsupervised monocular depth estimation algorithms to deal with dynamic objects in scenes, laying the research foundation for researchers in related fields. However, there has been no comprehensive analysis of the above methods. To address this issue, this paper systematically reviews and summarizes the progress of unsupervised monocular depth estimation in dynamic scenes based on deep learning. Firstly, the basic models of unsupervised monocular depth estimation based on deep learning are summarized, and how self-supervised constraints are applied between images is analyzed and explained. The basic framework diagram of unsupervised monocular depth estimation based on continuous frames is drawn. The impact of dynamic objects on images is explained from four aspects: epipolar lines, triangulation, fundamental matrix estimation, and reprojection error. Secondly, commonly used datasets and evaluation metrics for monocular depth estimation research are introduced. The KITTI and Cityscapes datasets provide continuous outdoor image data, while the NYU Depth V2 dataset provides indoor dynamic scene data, which are generally used for model training. The Make3D dataset has depth data but discontinuous images, which are generally used to test the generalization ability of the model. The algorithms are quantitatively analyzed using root mean square error (RMSE), logarithmic root mean square error (RMSE log), absolute relative error (Abs Rel), squared relative error (Sq Rel), and accuracies (Acc), and the performance of classic monocular depth estimation models in dynamic scenes is compared and analyzed. Then, based on different ways of handling dynamic objects, the research directions of robust depth estimation in dynamic scenes and dynamic object tracking and depth estimation are summarized and analyzed. Dynamic objects are extracted and treated as outliers during training model to minimize their impact, training solely on static background information, which is referred to as robust depth estimation in dynamic scenes. Accurately distinguishing dynamic foreground and static background and processing the two regions separately is referred to as dynamic object tracking and depth estimation. Various algorithms for detecting and segmenting dynamic objects based on optical flow information, semantic information, and other information while estimating their motion are explained. At the same time, the advantages and disadvantages of each type of algorithm are summarized and analyzed based on commonly used evaluation criteria. Finally, the future development directions of monocular depth estimation in dynamic scenes are discussed from the aspects of network model optimization, online learning and generalization, real-time operation capability of embedded devices, and domain adaptation of unsupervised learning.

Key Words:

dynamic scenes,monocular depth estimation,unsupervised learning,deep learning,3D reconstruction

本文暂时没有被引用！