基于金字塔池化網(wǎng)絡(luò)的道路場(chǎng)景深度估計(jì)方法
doi: 10.11999/JEIT180957
-
1.
浙江科技學(xué)院信息與電子工程學(xué)院 ??杭州 ??310023
-
2.
浙江大學(xué)信息與電子工程學(xué)院 ??杭州 ??310027
Depth Estimation of Monocular Road Images Based on Pyramid Scene Analysis Network
-
1.
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China
-
2.
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China
-
摘要: 針對(duì)從單目視覺(jué)圖像中估計(jì)深度信息時(shí)存在的預(yù)測(cè)精度不夠準(zhǔn)確的問(wèn)題,該文提出一種基于金字塔池化網(wǎng)絡(luò)的道路場(chǎng)景深度估計(jì)方法。該方法利用4個(gè)殘差網(wǎng)絡(luò)塊的組合提取道路場(chǎng)景圖像特征,然后通過(guò)上采樣將特征圖逐漸恢復(fù)到原始圖像尺寸,多個(gè)殘差網(wǎng)絡(luò)塊的加入增加網(wǎng)絡(luò)模型的深度;考慮到上采樣過(guò)程中不同尺度信息的多樣性,將提取特征過(guò)程中各種尺寸的特征圖與上采樣過(guò)程中相同尺寸的特征圖進(jìn)行融合,從而提高深度估計(jì)的精確度。此外,對(duì)4個(gè)殘差網(wǎng)絡(luò)塊提取的高級(jí)特征采用金字塔池化網(wǎng)絡(luò)塊進(jìn)行場(chǎng)景解析,最后將金字塔池化網(wǎng)絡(luò)塊輸出的特征圖恢復(fù)到原始圖像尺寸并與上采樣模塊的輸出一同輸入預(yù)測(cè)層。通過(guò)在KITTI數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn),結(jié)果表明該文所提的基于金字塔池化網(wǎng)絡(luò)的道路場(chǎng)景深度估計(jì)方法優(yōu)于現(xiàn)有的估計(jì)方法。
-
關(guān)鍵詞:
- 單目視覺(jué) /
- 深度估計(jì) /
- 神經(jīng)網(wǎng)絡(luò) /
- 金字塔池化網(wǎng)絡(luò)
Abstract: Considering the problem that the prediction accuracy is not accurate enough when the depth information is recovered from the monocular vision image, a method of depth estimation of road scenes based on pyramid pooling network is proposed. Firstly, using a combination of four residual network blocks, the road scene image features are extracted, and then through the sampling, the features are gradually restored to the original image size, and the depth of the residual block is increased. Considering the diversity of information in different scales, the features with same sizes extracted from the sampling process and the feature extraction process are merged. In addition, pyramid pooling network blocks are added to the advanced features extracted by four residual network blocks for scene analysis, and the feature graph output of pyramid pooling network blocks is finally restored to the original image size and input prediction layer together with the output of the upper sampling module. Through experiments on KITTI data set, the results show that the proposed method is superior to the existing method.-
Key words:
- Monocular vision /
- Depth estimation /
- Neural network /
- Pyramid pooling network
-
表 1 深度圖像的預(yù)測(cè)值與真實(shí)值之間的誤差和相關(guān)性
RMSE Lg Lg_rms a1 a2 a3 Fine_coarse[17] 2.6440 0.272 0.167 0.488 0.948 0.972 ResNet50[18] 2.4618 0.243 0.126 0.674 0.943 0.972 ResNet_fcn50[19] 2.5284 0.247 0.134 0.636 0.950 0.979 D_U[20] 2.8246 0.305 0.127 0.634 0.916 0.945 UVD_fcn[21] 2.6507 0.264 0.145 0.566 0.945 0.970 本文方法 2.3504 0.230 0.120 0.684 0.949 0.975 下載: 導(dǎo)出CSV
表 2 不同恢復(fù)尺度方法的結(jié)果
RMSE Lg Lg_rms a1 a2 a3 使用反卷積層恢復(fù)尺度的方法 2.3716 0.237 0.125 0.673 0.946 0.973 使用卷積塊恢復(fù)尺度的方法 2.4724 0.240 0.129 0.646 0.948 0.974 使用上采樣層恢復(fù)尺度的方法 2.3504 0.230 0.120 0.684 0.949 0.975 下載: 導(dǎo)出CSV
-
LUO Yue, REN J, LIN Mude, et al. Single view stereo matching[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 155–163. SILBERMAN N, HOIEM D, KOHLI P, et al. Indoor segmentation and support inference from RGBD images[C]. The 12th European Conference on Computer Vision, Florence, Italy, 2012: 746–760. REN Xiaofeng, BO Liefeng, and FOX D. RGB-(D) scene labeling: Features and algorithms[C]. 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, 2012: 2759–2766. SHOTTON J, SHARP T, KIPMAN A, et al. Real-time human pose recognition in parts from single depth images[J]. Communications of the ACM, 2013, 56(1): 116–124. doi: 10.1145/2398356 ALP GüLER R, NEVEROVA N, and KOKKINOS I. Densepose: Dense human pose estimation in the wild[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 7297–7306. LUO Wenjie, SCHWING A G, and URTASUN R. Efficient deep learning for stereo matching[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 5695–5703. FLINT A, MURRAY D, and REID I. Manhattan scene understanding using monocular, stereo, and 3D features[C]. 2011 International Conference on Computer Vision, Barcelona, Spain, 2011: 2228–2235. KUNDU A, LI Yin, DELLAERT F, et al. Joint semantic segmentation and 3D reconstruction from monocular video[C]. The 13th European Conference on Computer Vision, Zurich, Switzerland, 2014: 703–718. YAMAGUCHI K, MCALLESTER D, and URTASUN R. Efficient joint segmentation, occlusion labeling, stereo and flow estimation[C]. The 13th European Conference on Computer Vision, Zurich, Switzerland, 2014: 756–771. BAIG M H and TORRESANI L. Coupled depth learning[C]. 2016 IEEE Winter Conference on Applications of Computer Vision, Lake Placid, USA, 2016: 1–10. EIGEN D and FERGUS R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture[C]. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 2650–2658. SCHARSTEIN D and SZELISKI R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms[J]. International Journal of Computer Vision, 2002, 47(1/3): 7–42. doi: 10.1023/A:1014573219977 UPTON K. A modern approach[J]. Manufacturing Engineer, 1995, 74(3): 111–113. doi: 10.1049/me:19950308 FLYNN J, NEULANDER I, PHILBIN J, et al. Deep stereo: Learning to predict new views from the world's imagery[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 5515–5524. SAXENA A, CHUNG S H, and NG A Y. 3-D depth reconstruction from a single still image[J]. International Journal of Computer Vision, 2008, 76(1): 53–69. KARSCH K, LIU Ce, and KANG S B. Depth transfer: Depth extraction from video using non-parametric sampling[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2144–2158. doi: 10.1109/TPAMI.2014.2316835 EIGEN D, PUHRSCH C, and FERGUS R. Depth map prediction from a single image using a multi-scale deep network[C]. The 27th International Conference on Neural Information Processing Systems, Montréal, Canada, 2014: 2366–2374. LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual networks[C]. The 4th International Conference on 3D Vision, Stanford, USA, 2016: 239–248. FU Huan, GONG Mingming, WANG Chaohui, et al. Deep ordinal regression network for monocular depth estimation[C]. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 2002–2011. DIMITRIEVSKI M, GOOSSENS B, VEELAERT P, et al. High resolution depth reconstruction from monocular images and sparse point clouds using deep convolutional neural network[J]. SPIE, 2017, 10410: 104100H. MANCINI M, COSTANTE G, VALIGI P, et al. Toward domain independence for learning-based monocular depth estimation[J]. IEEE Robotics and Automation Letters, 2017, 2(3): 1778–1785. doi: 10.1109/LRA.2017.2657002 GARG R, VIJAY KUMAR B G, CARNEIRO G, et al. Unsupervised CNN for single view depth estimation: Geometry to the rescue[C]. The 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 2016: 740–756. KUZNIETSOV Y, STUCKLER J, and LEIBE B. Semi-supervised deep learning for monocular depth map prediction[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 6647–6655. GODARD C, MAC AODHA O, and BROSTOW G J. Unsupervised monocular depth estimation with left-right consistency[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 6602–6611. ZORAN D, ISOLA P, KRISHNAN D, et al. Learning ordinal relationships for mid-level vision[C]. 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015: 388–396. CHEN Weifeng, FU Zhao, YANG Dawei, et al. Single-image depth perception in the wild supplementary Materia[C]. The 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 2016: 730–738. HE Kaiming, ZHANG Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 770–778. ZHAO Hengshuang, SHI Jianping, QI Xiaojuan, et al. Pyramid scene parsing network[C]. 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 6230–6239. ZHOU Bolei, KHOSLA A, LAPEDRIZA A, et al. Object detectors emerge in deep scene CNNs[J]. arXiv preprint arXiv: 1412.6856, 2014. SZEGEDY C, LIU Wei, JIA Yangqing, et al. Going deeper with convolutions[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015: 1–9. UHRIG J, SCHNEIDER N, SCHNEIDER L, et al. Sparsity invariant CNNs[C]. 2017 International Conference on 3D Vision, Qingdao, China, 2017: 11–20. KINGMA D P and BA J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv: 1412.6980, 2014. -