Machine vision has long been the holy grail to unlocking a number of real-world use cases for AI – think of home automation and security, autonomous vehicles, crop picking robots, retail analytics, delivery drones or real time threat detection. However, until recently, AI models for computer vision have been constrained to expensive hardware with sophisticated hardware that often contain neural accelerators, or these models were required to be processed in the cloud with GPU or TPU enabled servers. Through Xnor’s groundbreaking research, in coordination with the Allen Institute for AI, on YOLO, Xnor-Net, YOLO 9000, Tiny YOLO and other AI models, we’ve been to able move machine learning from expensive hardware and the cloud, to simple, resource-constrained devices that can operate completely on-device and autonomously. This means you can run sophisticated deep learning models on embedded devices without the need for neural processors and without the need for a data connection to the cloud. For example, on a 1.4 GHz dual-core ARM chip with no GPU, we can run object detection with CPU utilization of only 55%, a memory footprint of only 20MB, and power consumption of less than 4.7W.

Object Detection

Let’s dig into one specific model that we’ve built – object detection. Object detection is a type of AI model that identifies categories of objects that are present in images or videos – think people, automobiles, animals, packages, signs, lights, etc. – and then localizes their presence by drawing a bounding box around them. Utilizing a CNN (convolutional neural network), the model is able to simultaneously draw multiple bounding boxes and then predict classification probabilities for those boxes based on a trained model.

Traditionally these models have been resource intensive because of the model architecture – the number of layers (convolution, pooling and fully connected) – and the fact that most CNN’s use 32-bit precision floating-point operations.

Xnor’s approach is different and we’ve summarized this approach below.

Xnorization (How It Works)

Our models are optimized to run more efficiently and up to 10x faster through a process we call Xnorization. This process contains five essential steps. First, we binarize the AI model. Second, we design a compact model. Third, we prune the model. Fourth, we optimize the loss function. Fifth, we optimize the model for the specific piece of hardware.

Let’s explore each of these in further detail

Model Binarization

To reduce the compute required to run our object detection models, the first step is to retrain these models into a binary neural network called Xnor-Net. In Binary-Weight-Networks, the filters are approximated with binary values. This produces results that are 58x faster for convolutional operations and a memory savings of up to 32x. Furthermore, these binary networks are simple, accurate, and efficient. In fact, the classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure).

To do this, both the filters and the input to convolutional layers are binary. This is done by approximating the convolutions using primarily binary operations. Finally, the operations are parallelized in CPUs and optimized to reduce model size. This gives us the ability to reduce floating point operations to as small as a binary operation, making it hyper efficient. Once completed, we have state-of-the-art accuracy for models that:

  • Are 10x faster
  • Can be 20-200x more power efficient
  • Need 8-15x less memory than traditional deep learning models

Compact Model Design

The second critical piece is to design models that are compact. Without compact model design, the compute required for the model remains high. Our Xnorized models utilize a compact design to reduce the number of required operations and model size. We design as few layers and parameters into the model as possible. The model design is dependent on the hardware, but we take the same fundamental approach for each model.

Sparse Model Design

Third, a variety of techniques are used to prune the model’s operations and parameters. This reduces the model size and minimizes the operations necessary to provide accurate results.  Here, most of the parameters are assigned zero as their value. The remaining parameters, which are very few, will be non-zero. By doing this, we can ignore all the computations for the zero parameters and only save the indexes and the values for the non-zero parameters.

Optimized Loss Functions

Fourth, we’ve built groundbreaking new techniques for retraining models on their own predicted model. Techniques like Label Refinery greatly increase accuracy by optimizing loss functions for a distribution of all possible categories. With Label Refinery, we actually rely on another neural network model to produce labels. These labels contain the following properties: 1) Soft; 2) Informative; and 3) Dynamic.

Soft labels are able to categorize multiple objects in an image and can determine what percentage of the image is represented by what object category. Informative labels provide a range of categories with the relevant confidence, so, for example, if something is mislabeled as a cat, you can know that the second highest category is dog. Dynamic labels allow you to ensure that the random crop is labeling the correct object in the image by running the model dynamically as you sample over the image.

You can learn more about this technique here.

Hardware Optimization

Lastly, because we’re building models for all sorts of embedded devices, we need to optimize the model for different hardware platforms to provide the highest efficiency across a broad range of Intel and Arm CPUs, graphical processing units (GPU), and FPGA devices. For example, we’ve partnered with Toradex and Amabrella to build person detection models that can be viewed here and here.

Results

By Xnorizing our models, we’re able to achieve cutting edge results. We have miniaturized models that are < 1MB in size and can run on the smallest devices. The models have fewer operations, faster inferences and higher frames per second, and low latency because they are running on device. And, we have fewer joules per inference which translates to lower power consumption.