Thèse de Zehua Fu
Dense stereo matching is always a fundamental problem in computer vision due to its wide applications related to three-dimension scene reconstruction, such as robotics, entertaiment, automated systems. It recovers depth information by finding the corresponding matches from the left image to the right image. For two images of different views on the same scene, taken by cameras with horizontal displacement, the task of stereo matching is to find the corresponding pixels between the left and right images. The distance between the corresponding points is called disparity and the set of all disparities in the image is called disparity map. Recently, many end-to-end convolutional neural networks (CNNs) are designed for stereo matching tasks. Although disparity maps estimated by such kind of CNNs outperform all traditional stereo matching methods, they still suffer from challenges such as depth discontinuities and outliers, which easily lead to wrong matches. In order to improve dense disparity estimation, several methods have been proposed to rate the correctness of matches. These methods are also called confidence measures.
The state-of-the-art deep learning-based confidence measures use disparity maps as input.The limitation of this kind of methods is that wrong matches near edges and in texture-less regions are still difficult to be detected. In our first contribution, we explored the confidence prediction abilities of different types of data in stereo matching and proposed a novel CNN method, which utilizes multi-modal data, including initial disparity maps and reference RGB images, as inputs. Then we studied fusion strategies of multi-modal data. As evaluated on both KITTI2012 and KITTI2015 dataset, our multi-modal approach reached the best performance during the time.
In our second contribution, we first discussed how receptive fields influence the prediction of disparity errors. Then we improved our multi-modal architecture by enlarging the effective receptive field with an adaptive dilated convolution block, learning more contextual information for the disparity error detection. Experimental results proved that our improved method could get higher performance on confidence prediction with the same computation resource as the original version of multi-modal architecture.
By combining with our multi-modal confidence network, we proposed a novel recurrent refinement module to refine disparity maps stage by stage, which is our third contribution. To the best of our knowledge, this is the first time to refine disparity maps by a recurrent neural network (RNN). Our proposed module can be easily applied to different stereo matching CNNs for end-to-end training. Experimental results proved that with our proposed confidence-based refinement module, significant improvement can be achieved on both stereo benchmark KITTI 2012 and KITTI 2015.
Encadrant : Mohsen Ardabilian