"A car is just software on wheels"
-NVIDIA Creator Jen-Hsun Huang


Training

For any car to be smart enough to drive safely on the road, they must have an internal GPU that was trained for thousands of hours on millions of frames of video. All companies researching autonomous vehicles use Machine Learning to create the most comprehensive 360 3D view from the car’s perspective.



The animation above is from NVIDIA's training program. They created a system called DriveNet which was trained on a group of 1.2 million images called 'ImageNet'. They actually warped the images and falsely color corrected them to make the trianing set 10 times as large. They fed these images to a convolutional neural network to create 1,000 different classes (ex: person, two people, street sign, dog, fire hydrant, cloud, etc). Using their accelerated GPUs, they were able to do this classification in only 1 month. Any other hardware would have taken a couple of years to train this neural network. Now, this same neural network can go beyond classification and actually detect these objects. NVIDIA calls this system CityScape and they will be sharing it. The video above shows a work-in-progress stage where it detected bounding boxes, then it learned to identify individual pixels and connect them with objects, color coding the exact pixels of each object.





Additionally, once the GPU is ready to categorize everything on the road, the cars need millions of hours of test driving. Tesla, for example, already claims that their cars have logged over 780 million miles.

Radar



Radar (RAdio Detection And Ranging) is radio waves. They travel far, are invisible to humans and are easy to detect even when they are faint. High-end cars already have radar to track nearby objects and control the speed of the car for you. Mercedes’ Distronic Plus accident-prevention system includes units on the rear bumper that trigger an alert when they detect something in the car’s blind spot. Google's self driving car has 4 radars, 2 on the back bumper, 2 on the front, and their main purpose it to maintain at least a 2-4 second distance between the car and all other objects.

LiDAR



LiDAR (Light Detection and Ranging) works like radar, but with pulses of light rather than radio waves. It calculates how far an object is from the moving vehicle based on the time it takes for the laser beams to hit the object and come back. Cars use lidar for obstacle detection and avoidance to navigate safely through environments. The LiDAR sensor sits at the top center of the car, revolving around 900 rotations/minute. Point cloud outputs from the lidar sensor provide the necessary data for software to determine where potential obstacles exist in the environment. Google employs Velodyne’s rooftop Light Detection and Ranging system, which uses 64 lasers, spinning at upwards of 900 rpm, to generate a point cloud that gives the car a 360-degree view.

Ultrasonic



An ultrasonic wave is a sound wave with a very high frequency (above 20,000 Hz) inaudible to humans. Ultrasonic detectors provide more accurate mapping of the surroundings at very short range. If you have a car that helps you park and backup, it's already using ultrasonic. It handles a rangle of about 1.2 to 4.5 meters. In self-driving cars, the ultrasonic sensors are in the wheels, as shown to the left.

Computer Vision

How do we get from camera input to classification?





Computer Vision requires using a convolutional neural network. The images below all link to great articles explaining how this works.




STEP1: Convolution Neural Networks are a layered process. The two images to the left both attempt to show how the system gets from the input image to the output classification. Let's break down this diagram and, as shown, start with a square image that is 32px wide and 32px tall.


STEP 2: Rather than focus on one pixel at a time, a convolutional net looks at the image by analyzing patches at a time. Most examples show a 5px x 5px feature analysis. Let's say that our first feature layer is looking for edges. It will scan the 32x32 image 25 pixels at a time (5px x 5px), recording wether or not that patch has an edge. Iterating through the image, building a new array of information, if the feature believes that the input image has an edge, then it pushes a 1 to the next layer, otherwise a -1. This creates a new layer that is a smaller array of information than the previous one. In the case of self driving cars, some neural networks have been trained with 96 different feature layers (edge detection being 1 of those 96).


Note: the image is showing a 7px x 7px image being reduced to a 5px x 5px array of information


STEP 3: Now we have an array for each feature that is basically a map indicating where in the image that feature probably appears (on a scale of -1 to 1). If the feature we're looking for is a diagnal line, the new array would have a 1 everywhere there is definitely a dignal line going through those pixels.


Ref: 'How Convolutional Neural Networks work' YouTube Video


STEP 4: POOLING. For every convolved (filtered) layer, we pool it. This simpy means to take the largest value in a pre-defined pixel square, and push that into a new pooled layer that will have significantly less numbers in it.



STEP 5: Final Vote. After enough iterations of convolving and pooling, each pooled number now predicts what the image is.

Sensor Fusion



Sensor fusion is the process of combining all sensory data derived from the sensors fitted on the car. The data is analysed in stages to identify the car’s location and its surroundings. For a car to be able to drive itself, it requires sensors fitted on the car, a powerful computer and a GPU. The GPU is where the real time computing takes place, thus the PX2 GPU manufactured by NVIDIA is the top choice for car companies building Self Driving cars.




How the car analyses data and makes decisions:

The car is receiving inputs in the form of weather data, image data and sensory data from the sensors like the Ultrasonic Sensor, Radar, Lidar and GPS. The sensory data from each sensor is analysed individually to detect an obstacle and create a depth map.
The 2 field view cameras fitted at the front and at the back of the car are. The images captured are taken through a network called DriveNet, that has been trained by NVIDIA on ImageNet to recognize a range of automotives like cars, busses etc. (ImageNet is a database of over 15 million hand labelled images that learning algorithms are trained on). This enables the car to recognize the obstacles by their name.
The Lidar sensor create depth map that is a dynamic 3D model of the environment from point cloud data.

Data Fusion:

Based on the inputs from the depth maps, a 2D occupancy grid is created that encodes which cells contain depth measurements. From this grid, the sensory fusion algorithm extracts obstacles from the occupied cells and refines their positions, to determine the uncertainty of each detection.
The car assumes that it is moving on a plane, and any object that is not on the plane is identified as a potential obstacle. Each grid cell is checked to see if it is occupied and is initialized with weight zero and updated whenever new obstacle detections are available by adding the respective weight to the grid cells The update is constantly performed in a separate compute thread concurrent to depth map generation and obstacle detection and to update the occupancy map.
The occupancy map is color coded using the data received from the sensors to mark which areas are identified as obstacles and objects that are moving cars on the road and marks the distance of the car from these obstacles.

Checking the Geographical Location from GPS Data:

At this point we have the model of the world around the car and the car is travelling through this space. The highly accurate model of the world is created offline using the data collected by various cars. This GPS data is available for the car to access to identify its geographical location and direction of travel and to find out which lane it is driving in, so that then it can run more algorithms to do motion planning and navigation. This happens using the Localization Algorithm.
The combined output from the Sensor Fusion and Deep Learning algorithms gives the car information about its location and surroundings which can be used by the ‘path planning algorithm’ to make decisions about how to navigate.

Resources

PX2 NVIDIA INFO
NVIDIA End to End
Sensor Fusion PDF
Tesla Demo Video

More to come . . .