Human detection and distance estimation with monocular camera using YOLOv3 neural network
Making machines perceive environment better or at least as well as humans would be beneficial in lots of domains. Different sensors aid in this, most widely used of which is monocular camera. Object detection is a major part of environment perception and its accuracy has greatly improved in the last few years thanks to advanced machine learning methods called convolutional neural networks (CNN) that are trained on many labelled images. Monocular camera image contains two dimensional information, but contains no depth information of the scene. On the other hand, depth information of objects is important in a lot of areas related to autonomous driving, e.g. working next to an automated machine, pedestrian crossing a road in front of an autonomous vehicle, etc. This thesis presents an approach to detect humans and to predict their distance from RGB camera for off-road autonomous driving. This is done by improving YOLO (You Only Look Once) v3, a state-of-the-art object detection CNN. Outside of this thesis, an off-road scene depicting a snowy forest with humans in different body poses was simulated using AirSim and Unreal Engine. Data for training YOLOv3 neural network was extracted from there using custom scripts. Also, network was modified to not only predict humans and their bounding boxes, but also their distance from camera. RMSE of 2.99m for objects with distances up to 50m was achieved, while maintaining similar detection accuracy to the original network. Comparable methods using two neural networks and a LASSO model gave 4.26m (in an alternative dataset) and 4.79m (with dataset used is this work) RMSE respectively, showing a huge improvement over the baselines. Future work includes experiments with real-world data to see if the proposed approach generalizes to other environments.
The following license files are associated with this item: