Vision-based navigation using Deep Reinforcement Learning

Jonáš Kulhánek Delft University of Technology
Erik Derner Czech Technical University in Prague
Tim de Bruin Delft University of Technology
Robert Babuška Delft University of Technology
observation target


Deep reinforcement learning (RL) has been successfully applied to a variety of game-like environments. However, the application of deep RL to visual navigation with realistic environments is a challenging task. We propose a novel learning architecture capable of navigating an agent, e.g. a mobile robot, to a target given by an image. To achieve this, we have extended the batched A2C algorithm with auxiliary tasks designed to improve visual navigation performance.

We propose three additional auxiliary tasks: predicting the segmentation of the observation image and of the target image and predicting the depth-map. ViewFormer architecture overview These tasks enable the use of supervised learning to pre-train a large part of the network and to reduce the number of training steps substantially. The training performance has been further improved by increasing the environment complexity gradually over time. An efficient neural network structure is proposed, which is capable of learning for multiple targets in multiple environments. Our method navigates in continuous state spaces and on the AI2-THOR environment simulator outperforms state-of-the-art goal-oriented visual navigation methods from the literature.

SUNCG dataset results

We compared our method (A2CAT-VN) with the batched A2C extended with the original two UNREAL auxiliary tasks on the SUNCG dataset [1]. Our algorithm A2CAT-VN converged much faster with the additional auxiliary tasks for visual navigation enabled, reaching the average episode length of 200 in roughly 3 · 106 frames whereas without the additional tasks the training took roughly 8 · 106 frames to get to the same level. The following plot shows the average episode length during training.
SUNCG training comparison

AI2-THOR environment results

We have trained our algorithm on four environments from AI2-THOR simulator [2] with multiple targets. Our method is compared to Zhu's method [3]. The environments we chose for this experiment were bigger and more difficult to navigate than those used in [3], but came from the same AI2-THOR simulator. The training with our algorithm took roughly one day, while it took three days to train the network by using the algorithm described in [3]. The following plot compares our model (A2CAT-VN) with Zhu's work [3] (ICRA2017) and shows the average episode length during training.
AI2THOR training comparison


[1] Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1746-1754, 2017.
[2] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2-THOR: An interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474, 2017.
[3] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357-3364, 2017.


Please use the following citation:
  title={Vision-based navigation using deep reinforcement learning},
  author={Kulh{\'a}nek, Jon{\'a}{\v{s}} and Derner, Erik and De Bruin, Tim and Babu{\v{s}}ka, Robert},
  booktitle={2019 European Conference on Mobile Robots (ECMR)},