More refined labels at training yield higher accuracy for on-device models

Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, Ali Farhadi

Advancements in deep learning opened the possibility for any camera to operate as a smart sensor capable of seeing and understanding a range of visual content. Deep neural networks are now powering technology underlying most of our online interactions through social, shopping, watching TV and movies, managing our photos. What led to the rise of deep learning? One basic reason is the parallel advancement in cloud computing and the accessibility of GPU’s to train and run such large neural networks.

What happens when we move AI onto edge devices?

This accessibility to GPUs is why the early adoption of AI was mainly for problems that could be solved in the cloud. But there are many real-world applications that would benefit from the promise of AI, for example, smart devices in the home that can recognize food in fridges, map drone direction in the sky and improve the outcomes of search and rescue missions.

Smaller neural networks

The first approach that enabled deep neural networks to even run at the edge was to implement efficient network architectures that require radically less memory and compute. The three main approaches are:

Low-Precision Models

Part of what makes deep learning so computationally intense are the convolutional operations. A standard full precision model relies on 32-bit floating point convolutional operations, which for standard deep learning models with billions of operations means a massive level of compute is required to make even a single inference. Through our work on XNOR-NET, we have shown one approach to reducing the model size is to drive convolutions down all the way from 32-bit to 1-bit. This approach has demonstrated the ability to radically reduce the memory and compute of deep learning models. Furthermore, given the binary nature of the operations the models can be easily embedded into hardware that relies on binary operations.

Sparse Models

Another way to reduce the model size and the computation is using sparse parameters, where most of the parameters are assigned to zero as their value and very few will remain non-zero. This simply means that we can ignore all the computations for the zero parameters and only save the indexes and the values for the non-zero parameters. We have developed LCNN: Look-up based convolutional neural networks that benefit from the sparse structure of the model to improve the efficiency of the network.

Compact Network Design

MobileNets, pioneered by Google, are an example of a streamlined architecture that relies on depth-wise separable convolutions. These models make it easy to trade off between accuracy and latency dependent on the constraints of the model’s application.

How can we improve accuracy for edge models?

Optimizing network architectures was a critical first step in getting deep learning models to run on edge devices. Now that the problem of how to run AI on the edge has been solved, we have been focused on new ways to improve performance. One approach that has been largely overlooked until now is data labeling, so we asked the question:

Are the current data labels for standard mainstream AI tasks good enough for training the models running in resource-constrained environments?

Training Data for Object Classification

To train a deep learning model to classify objects we need to provide a set of images that have been “tagged” with those objects. Until now, the field of computer vision has relied heavily on standard images sets such as ImageNet. Because they work to a large extent, researchers have not questioned how effective the labels are at providing perfect levels of generalization from training datasets to a test set.

Challenge 1: Incomplete label for one image

Sample image labeled ‘Persian cat’ in ImageNet’s training set

This image has the data label “Persian cat,” which means that we are training to model to learn that everything in this image should be classified as a Persian cat. As humans, it’s trivial for us to see that this image actually contains a Persian cat playing with a ball, where “ball” is considered an object category in this dataset but not labeled in this image. We understand that the cat is a separate object from the ball, that the cat is significantly bigger than the ball, and that the cat is living and the ball is non-living. Despite all of this complexity in the image, with only one data training label, we are simply telling the network that this image = Persian cat. Hence, this labeling is incomplete.

Challenge 2: Inconsistencies from random cropping — cropping the wrong object

To prevent overfitting (where models perform poorly because of an inability to generalize to novel images), various training techniques have been introduced to prevent models from memorizing actual images. One approach commonly used is to take a randomly sampled crop from different areas of the image, known as crop-level augmentation, which can represent as little as 8% of the entire image. While this approach improves the model’s ability to generalize from training to test datasets, it introduces inconsistencies between the image label and crop when there is more than one object in the image. For example, in the image below, a random crop may contain the pixels from the ball, but be labelled “Persian cat”.

example of random cropping augmentation

Challenge 3: Inconsistencies from random cropping — similarities between crops

Another challenge of training models on image crops is that there are opportunities for objects with different image-level labels to look indistinguishable to the model, for example a patch of bread could look very similar to a patch of butternut squash. To maintain high-levels of accuracy, it’s important for the model to be able to distinguish between these two food categories.

Example of random crop from bread (left) and butternut squash (right)

Challenge 4: Similar or very different mis-categorizations deliver the same penalty to model

As humans, we have all looked at a little dog that could be a cat, or thought a big dog looks like a bear. Currently, the way that most models are designed is that any misclassification or mistake is penalized equally. The way current labels are determined does not provide any insight into related categories. For example, the model receives the same level of penalization if it mis-categorizes a cat as either a dog or as a mirror on a car.

Examples of the limitations from taxonomy dependency

Large models that have a large model complexity are able to learn despite these inconsistencies in training data, but when models are lower precision, sparser, or more compact, then these inconsistencies in the training data cannot be resolved and accuracy of the model suffers.

How can we create training labels that deliver higher accuracies?

Given the challenges in data labeling identified above, we proposed a new iterative procedure to dynamically update ground truth labels using a visual model trained on the entire dataset. This new approach is called Label Refinery and relies on a neural network model to produce labels with the following properties that are consistent with the image content:

  • Soft
  • Informative
  • Dynamic

Soft labels are able to categorize multiple objects in an image and can determine what percentage of the image is represented by what object category. For the cat and ball example above, we are able to classify the image as 80% cat and 20% dog.

Informative labels provide a range of categories with the relevant confidence, so that if something is mislabeled as a cat, you can know that the second highest category is dog.

Dynamic labels — this approach to labeling allows you to ensure that the random crop is labeling the correct object in the image by running the model dynamically as you sample over the image.

How does label refinery impact accuracy?

We evaluated the approach of label refinery on the standard Image-Net ILSRVC2012 classification challenge using a variety of model architecture.

The figure below shows the first label generated from Image-Net and goes on to show how the model refines the labels over time. The graphs show the line of “perfect generalization” where a model perfectly generalizes from training data to test. Our results show that with progressively more automatic label refining, the model performance moves closer and closer to perfect generalization. This trend towards perfect generalization is reflected in the accuracy performance table below.

Figure taken from Label Refinery paper

Table taken from Label Refinery paper

Why is Label Refinery critical for boosting accuracy at the edge?

The results for implementing label refinery show interesting findings. We found that large models with small generalization gap (differences between the accuracy of the train and the test) are able to better handle situations where data labels are imprecise. This means that running label refinery has less impact on these models. However, where we see the biggest boost in performance is on models that have been compressed down to optimize for small memory and compute or models with a large generalization gap. This is why Label Refinery is critical for boosting accuracy on edge models running in resource-constrained environments.

With this approach, combined with Xnorized network architectures, we can now create models that are small, power efficient, low latency, and have a high degree of accuracy that can power smart devices as small as a doorbell and as mobile as a drone.

Source code is available on GitHub for Label Refinery —

Paper on Archive

Source code is available on GitHub for XNOR-NET —

Learn how to deploy Edge AI models at’s developer workshop

At the Edge AI Summit in San Francisco on December 11, we will show how we’re using the Label Refinery and multiple other algorithm and model optimizations to create real-time AI solutions on hardware as small as a Raspberry Pi Zero.