I’ve loved cars since I was a little boy. From classic cars to custom hot rods, I loved them all, but I was especially fascinated by the futuristic vehicles featured on TV. Depending on which generation you identify with, you might remember Kitt from Knight Rider, the Batmobile, or the nameless Delorean from Back to the Future. Not only were these cars fast, they could think, talk and sometimes even see.

AI has given us the first generation of autonomous cars — and it’s pretty impressive. But there is a host of next generation of AI-enhanced features that go even further in providing convenience and ensuring passenger safety.

Auto-evolution: AI at the edge for cars

Xnor is focused on bringing computer vision to edge devices, so our technology is particularly valuable for automobiles and commercial vehicles. Every AI capability we offer – whether it involves people, object or face recognition – delivers a degree of speed and accuracy that until recently, was only possible using a high-end processor augmented by a neural accelerator. We take that same level of performance, improve upon it, and make it available on an edge device, such as a 1 GHz ARM processor or a simple onboard computer.

Check out this demo of our computer vision technology:

Object detection capabilities

Crime prevention

For car sharing companies or taxis, the system can enforce security regulations by recognizing when passengers hold weapons or other objects that present a safety hazard.

Loss prevention

Using object detection, the system can remind a passenger to retrieve the phone or purse they left on the seat. Transportation and logistics companies could receive an alert if a package was not delivered at the end of a route.

Face recognition capabilities

Here are a few of the capabilities that can be incorporated into a line of vehicles using Xnor’s face recognition or action detection models.

Secure access

Using face recognition, a driver can be authenticated even before they enter a vehicle. The door could automatically open for people recognized by the car, making hands-free entry possible. Our technology would even allow the car to differentiate between children and adults. Commercial vehicles could use that information to control access to certain areas by authorizing drivers.

Because all of this is done on-device, the data doesn’t need to be transmitted to the cloud, making it significantly more secure and practical of a feature.


Once a driver or passenger is authenticated, the car could adjust settings to align with personal preferences, such as the position of the seat and steering column, interior temperature and infotainment system settings.

Driver awareness

ML-powered driver monitoring can tell when a driver is looking at a phone, instead of the road ahead. And if the driver becomes drowsy and their eyelids start to close, the system will know that too.

Emergency response

In the event of a crash or another emergency, the system can generate a passenger list, and notify someone if the driver does not respond to an audible alarm.

Passenger safety

Action detection models can be trained to detect specific gestures like fastening a seatbelt to ensure that everyone is buckled in.

Person and pet detection models can identify if a pet is left inside a car (a potentially dangerous situation on a hot day) or if an infant or small child is left behind, and then sound an alarm to notify the driver.

AI at the edge drives automotive innovation

Without recent advances in deep learning for computer vision, many of these features would be too difficult or expensive to implement.

Xnor’s AI technology is unique in that it delivers state-of-the-art performance on a commodity processor, using only the bare minimum for energy and memory.

Even with a simple onboard computer, Xnor models execute at up to 10x faster than conventional solutions – while using up to 15x less memory and 30x less energy.

Taken together, all these capabilities make it both practical and profitable for automobile manufacturers to incorporate high-performance computer vision into a variety of applications for the commercial and consumer vehicle markets.

At Xnor, we’re fascinated by the creative and powerful ways our customers are working to incorporate machine learning into their line of cars and commercial vehicles. It’s not as cool as owning one of the super-smart, fast-talking exotic cars that my TV heroes used to drive, but it comes pretty close.

Read more about how you can incorporate the latest in computer vision into your line of vehicles.

Search for the term “the future of retailing” and you’ll see plenty of stories about physical retailers being marginalized by their dot-com counterparts. Some would say that physical stores are fading from the retail landscape. Quaint, but doomed. To understand why consider the shopping experiences offered by each channel.

Online vs. Offline

For example, while checking the number of followers in their Instagram account, your future customer sees an image of their favorite celeb wearing shoes that they simply must have. Other distractions intervene, but after seeing several banner ads they finally click, swipe or tap their way to an online store. Thanks to cookies and ad tracking, the site already knows a great deal about the customer, from their purchase history down to their shoe size. The customer browses for products, reads reviews and compares items. With each click, the store knows a little bit more.

As the customer moves through the site, the convenience, selection and price advantage of shopping online becomes obvious. When they make a purchase, the customer can be rewarded for their loyalty with a coupon code, and the inventory system knows which item to reorder.

On the other hand, a retail store doesn’t know who you are the moment you walk in the door. They don’t know if you’ve bought from them – or from any of their competitors – before. They have no idea what color you like, or what shoe size you wear. Traditional retailers rely heavily on in-store displays or staff to guide customers through the store.

Now replay that scenario – but with one difference. This time it’s a physical store equipped with the latest generation in AI. Small cameras placed throughout the store use computer vision to provide an advanced level of retail analytics, possibly even better than what is available to online stores, while also creating a better experience for shoppers.

The Customer Journey in an AI-enabled Store

In this new scenario, a face recognition algorithm identifies customers and their demographics as they walk through the front door. Maybe this individual is a regular shopper and a member of your loyalty program. Based on their purchase history, you can send them a notification while they are in your store about new offerings that may be enticing to them.

As they move through the aisles, multiple cameras recognize that customer as the same person and track them throughout the store. Do the endcap displays attract their attention? Where do they stop and spend time? Does the location of a preferred product impact what else they buy nearby? Once your customers are at the check-out counter, payment can be as simple as a quick scan of their face.

On a larger scale, this data can be used to develop in-depth, real-time heatmaps without having to lift a finger. The information can also be bolstered with other AI capabilities such as emotion detection and action recognition in order to build highly detailed customer insights. Your customers and their paths through the store are now actionable data for your business, opening up a vast number of opportunities.

Security and Store Operations

The analytics you collect on the floor will impact your customers and their experiences, but there’s a slew of potential opportunities behind the scenes that can streamline operations for your business.

Surveillance and access control are important in-store functions for avoiding crime and unauthorized activity. Using Xnor’s AI capabilities, security can be enhanced with features like weapon or dangerous action detection. Secure areas can be better controlled with computer vision solutions like face recognition and person detection to make sure only the right people have access to restricted areas.

Another particularly valuable function is inventory management. Knowing when items are out of stock on the shelves helps to restock more efficiently. Creating efficient, real-time solutions for monitoring items also helps to keep vendors up-to-date on their products within your store as well as how they are performing. This can also be tied to traffic patterns so you can understand how often people are interacting with different products.

Gaining a competitive advantage

Many see the future of retail as being fully automated, but that shift won’t happen overnight. Retailers are beginning to introduce these capabilities piece by piece in order to stay ahead without having to completely overhaul operations. By incorporating AI solutions developed by Xnor, your store will avoid the headaches of conventional AI solutions. Xnor models can run on commodity devices, so you don’t need to upgrade your cameras or pay for expensive cloud-computing services (which are less secure). Running on-device also reduces latency and power consumption so your solutions will pick up that power-walker even on a battery-powered camera that you can place anywhere.

With Xnor’s computer vision models, physical stores can have the retail analytics they need to compete with their online counterparts – and help a loyal customer to find the perfect pair of shoes.

Visit Xnor to learn how the next generation in AI can help your retail store compete.

Mention Smart Appliance, and most people think of using a smartphone to turn on house lights as they pull in the driveway, arm security systems, control thermostats, or check if Amazon left a package on the front porch. Initially, that level of functionality was impressive. But so far, the value associated with Smart Appliances has been centered around heightened security and managing your home from a remote location.

It’s time Smart Appliances got an upgrade.

Smart Appliances V1

The first iterations of Smart Appliances were hampered by technical limitations. In some cases, the only smart thing about the earliest versions was touch screen interfaces, Bluetooth connectivity and the option to use a mobile device to control the appliance. Advanced features like food detection, if it was used at all, was constrained by the limitations inherent in AI technology at that time. One of those factors was the processing power needed to run an AI application. AI apps that could recognize and identify specific varieties of food required a robust processor with a neural or GPU accelerator, as well as an ample power source. Incorporating a power-hungry processor into the design of an energy-efficient appliance wasn’t practical. It also required a persistent, high bandwidth connection to the cloud. The resulting latency could delay system response to user input and create a poor customer experience. At any rate, aside from the onerous compute requirement, food detection models were still in their infancy. They were often inconsistent, and it was difficult to train them to identify new items.

The new generation of food identification technology promises to break through those barriers. With highly efficient algorithms, AI apps can be run on a small embedded device inside the appliance, without a persistent, high-bandwidth, internet connection.

Here are a few ways AI on the Edge can make a Smart refrigerator a little smarter:

  • Add items to a shopping list when they need to be replenished
  • Suggest a recipe based on the items you already have in your refrigerator
  • Make grocery shoppers faster and more informed
  • Make recommendations for how best to store certain produce
  • Provide cooking tips for certain foods
  • Detect when there’s a spill inside

With this kind of upgrade, homeowners can use the new generation of Smart Appliances to reduce their monthly grocery bill, reduce waste, and save time at the grocery store.

Compact, efficient algorithms are the brains behind smart appliances

With Xnor’s efficient, on-device computer vision models, smart appliances are now becoming a reality. Xnor’s food identification models offer appliance manufacturers some specific advantages over conventional AI solutions:

Improved performance

The new generation of food identification technology brings AI to edge devices, so there’s no need for internet connectivity. When Smart Appliances aren’t tethered to an internet connection, they are more responsive. Plus, there’s no risk of downtime due to a network or service outage. That translates into a better experience for consumers.

Improved accuracy

Even an item as ubiquitous as a Granny Smith apple comes in a variety of shades, sizes, and shapes. Our highly efficient training models deliver substantially higher accuracy, making it possible to visually identify food items in less than ideal lighting conditions, even if they are partially obscured.

Reduced energy use

Keeping energy consumption to a minimum is a top priority for appliance manufacturers. Xnor’s food detection models have been shown to be up to 30x more energy efficient than conventional AI technology.

Lower costs

Without the need for fast, power-hungry processors, the cost of introducing these features comes way down. Combined with low energy use and internet-free, on-device computing, its now possible to incorporate advanced food detection capabilities into a range of products at multiple price points.


There’s a multitude of tasks involved in preparing a meal. By going beyond preserving and cooking food, refrigerators will begin to behave less like an appliance, and more like a virtual sous-chef. As a company that’s invested a significant amount of research in this area, we’d like to say, “Bon-Appetit!”

Visit us to learn how the next generation in food detection technology can boost the performance of your Smart Appliance.


2010 was a milestone year for face recognition. That’s when Facebook introduced a photo tagging feature with the ability to identify individuals in a photograph by matching faces to the pictures stored in a user’s profile. The feature was popular but frequently inaccurate. Getting the best results required the people in the photograph to look directly into the lens. Accuracy was also dependent on the quality of the user’s Facebook profile picture and other photos they were tagged in. Blurs caused by camera motion, reflective surfaces and light levels all had a negative impact on performance. But it was a start.

Flash forward nine years. Face recognition has been adopted by several industries, most notably in the areas of law enforcement and home / commercial security. Biometric measures such as retinal scans and voice analysis are also useful in security applications, but face identification is still the preferred method.

Other biometric measures require users to physically interact with a device or to voluntarily position themselves next to a sensor. Think of pressing your palm against a reader, speaking directly into a microphone, or staring, unblinking, into a lens while a computer scans your retina. Measurements like these are impractical when it comes to identifying one individual in a large group of people moving through an airport.

Despite the inherent advantages of face recognition, the technology is still in its infancy. Here are four areas where the standard approach has failed to live up to its potential.

The limitations of standard face recognition technology

1) Low accuracy

Camera angles have a strong influence on how successfully a face can be detected and identified. Most of the existing models need to compare multiple angles, including profiles and full-frontal views, to achieve the best results. Facial hair, makeup, scarves, and hats can cause trouble. Ideally, a subject must hold still, remove their eyeglasses and look into the lens or a number of photos have to be taken from different angles. This makes training for face recognition extremely difficult.

2) Compute requirements

Whether it’s analyzing images to run the model or training a new model, traditional recognition algorithms need to run on a robust processor with a neural or GPU accelerator – and they need a persistent, high-bandwidth connection to the cloud. In fact, during training, most face recognition algorithms require multiple photos from thousands of people. Once the parent model is trained, the model still has to be pushed to the cloud or run expensive hardware to work for your specific face. This causes latency and security issues and delivers a poor user experience.

3) Inflexible deployment options

Standard technology requires developers to accommodate the need for fast processors and access to cloud-hosted servers. That rules out deploying face apps in remote areas and on cheap devices. This limits the applications for face identification and forces developers using computer vision apps to make compromises on user experience, responsiveness, accuracy, and data security.

4) High cost

Unsurprisingly, incorporating face recognition capabilities into an existing app often requires a hardware upgrade.

Self-contained deep learning models

At Xnor, we realized that eliminating these restrictions required a completely new approach, so we started at the beginning: the learning models. Our computer vision technology is trained to operate in a range of environmental conditions. The resulting facial signatures can accurately analyze faces in live video streams at more than 30 FPS on GPU-enabled hardware and at 4 FPS on resource-constrained hardware, such as a CPU, regardless of changing lighting conditions, movement or camera angles.

In real life, people don’t stare directly at a lens, without moving or waiting for an algorithm to do its work. People are in motion. Expressions can change several times in the time it takes you to read this paragraph. Faces can be partially obscured by eyeglasses, a scarf, a hat, makeup or even earrings. Our deep learning models ensure accuracy regardless of the subject’s skin tone or fashion sensibilities.

Even better, the training for the individual face can happen completely on-device, with as few as three images. This means you don’t need to take hundreds or thousands of photos of a face or use a large number of frames from a video.  This makes our solution completely edge-enabled. There’s no need to rely on a cloud solution or risk downtime with network and service outages, and most importantly, it makes face identification possible for cheap hardware.

Speed and reliability

Xnor’s apps can detect and identify individual faces in real-time, on-device (at up to 5 frames per second), utilizing a commodity camera, or on embedded hardware running on a processor as small as 1 GHz. In fact, we’re currently running face recognition on an Ambarella S5L commodity chip. Without the need for an internet connection, the real applications for these ML algorithms are enormous. It’s now possible to use advanced face identification features in remote locations, or in situations where maximizing uptime is essential.


Our face recognition algorithms and training models can be run completely on-device, using a low-end processor. Personal information is stored on the device, not transmitted to the cloud for processing, where it can become vulnerable to security breaches. Taken together, these capabilities allow developers to build face identification apps that not only offer increased performance, they go farther in protecting sensitive data.

A new approach yields new capabilities

In addition to enhancing performance, Xnor’s technology allows developers to integrate new capabilities into their applications, such as the ability to determine the subject’s age or gender, which direction they are looking, and whether the subject is happy, angry, scared, sad or surprised. This new technology will create new opportunities for developers to use face recognition in more powerful ways, in more scenarios, and, most importantly, on more devices.

Visit us to learn how to incorporate the next generation of face recognition into a broad range of applications.

Machine vision has long been the holy grail to unlocking a number of real-world use cases for AI – think of home automation and security, autonomous vehicles, crop picking robots, retail analytics, delivery drones or real time threat detection. However, until recently, AI models for computer vision have been constrained to expensive hardware with sophisticated hardware that often contain neural accelerators, or these models were required to be processed in the cloud with GPU or TPU enabled servers. Through Xnor’s groundbreaking research, in coordination with the Allen Institute for AI, on YOLO, Xnor-Net, YOLO 9000, Tiny YOLO and other AI models, we’ve been to able move machine learning from expensive hardware and the cloud, to simple, resource-constrained devices that can operate completely on-device and autonomously. This means you can run sophisticated deep learning models on embedded devices without the need for neural processors and without the need for a data connection to the cloud. For example, on a 1.4 GHz dual-core ARM chip with no GPU, we can run object detection with CPU utilization of only 55%, a memory footprint of only 20MB, and power consumption of less than 4.7W.

Object Detection

Let’s dig into one specific model that we’ve built – object detection. Object detection is a type of AI model that identifies categories of objects that are present in images or videos – think people, automobiles, animals, packages, signs, lights, etc. – and then localizes their presence by drawing a bounding box around them. Utilizing a CNN (convolutional neural network), the model is able to simultaneously draw multiple bounding boxes and then predict classification probabilities for those boxes based on a trained model.

Traditionally these models have been resource intensive because of the model architecture – the number of layers (convolution, pooling and fully connected) – and the fact that most CNN’s use 32-bit precision floating-point operations.

Xnor’s approach is different and we’ve summarized this approach below.

Xnorization (How It Works)

Our models are optimized to run more efficiently and up to 10x faster through a process we call Xnorization. This process contains five essential steps. First, we binarize the AI model. Second, we design a compact model. Third, we prune the model. Fourth, we optimize the loss function. Fifth, we optimize the model for the specific piece of hardware.

Let’s explore each of these in further detail

Model Binarization

To reduce the compute required to run our object detection models, the first step is to retrain these models into a binary neural network called Xnor-Net. In Binary-Weight-Networks, the filters are approximated with binary values. This produces results that are 58x faster for convolutional operations and a memory savings of up to 32x. Furthermore, these binary networks are simple, accurate, and efficient. In fact, the classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure).

To do this, both the filters and the input to convolutional layers are binary. This is done by approximating the convolutions using primarily binary operations. Finally, the operations are parallelized in CPUs and optimized to reduce model size. This gives us the ability to reduce floating point operations to as small as a binary operation, making it hyper efficient. Once completed, we have state-of-the-art accuracy for models that:

  • Are 10x faster
  • Can be 20-200x more power efficient
  • Need 8-15x less memory than traditional deep learning models

Compact Model Design

The second critical piece is to design models that are compact. Without compact model design, the compute required for the model remains high. Our Xnorized models utilize a compact design to reduce the number of required operations and model size. We design as few layers and parameters into the model as possible. The model design is dependent on the hardware, but we take the same fundamental approach for each model.

Sparse Model Design

Third, a variety of techniques are used to prune the model’s operations and parameters. This reduces the model size and minimizes the operations necessary to provide accurate results.  Here, most of the parameters are assigned zero as their value. The remaining parameters, which are very few, will be non-zero. By doing this, we can ignore all the computations for the zero parameters and only save the indexes and the values for the non-zero parameters.

Optimized Loss Functions

Fourth, we’ve built groundbreaking new techniques for retraining models on their own predicted model. Techniques like Label Refinery greatly increase accuracy by optimizing loss functions for a distribution of all possible categories. With Label Refinery, we actually rely on another neural network model to produce labels. These labels contain the following properties: 1) Soft; 2) Informative; and 3) Dynamic.

Soft labels are able to categorize multiple objects in an image and can determine what percentage of the image is represented by what object category. Informative labels provide a range of categories with the relevant confidence, so, for example, if something is mislabeled as a cat, you can know that the second highest category is dog. Dynamic labels allow you to ensure that the random crop is labeling the correct object in the image by running the model dynamically as you sample over the image.

You can learn more about this technique here.

Hardware Optimization

Lastly, because we’re building models for all sorts of embedded devices, we need to optimize the model for different hardware platforms to provide the highest efficiency across a broad range of Intel and Arm CPUs, graphical processing units (GPU), and FPGA devices. For example, we’ve partnered with Toradex and Amabrella to build person detection models that can be viewed here and here.


By Xnorizing our models, we’re able to achieve cutting edge results. We have miniaturized models that are < 1MB in size and can run on the smallest devices. The models have fewer operations, faster inferences and higher frames per second, and low latency because they are running on device. And, we have fewer joules per inference which translates to lower power consumption.

Much of the convenience and security that Smart Homes have claimed to promise has yet to become a reality. To understand why, consider that the technology behind a Smart Home historically required significant CPU power combined with a GPU or an accelerator chip to provide capabilities like object detection and face identification. To keep solutions affordable, today’s solutions are missing these advanced features.

Now the newest generation of AI tech will allow software engineers to get past those barriers. We refer to it as AI at the Edge. Not only does it drive costs down, it enables a whole new suite of enhanced object detection and face identification capabilities, making it possible to deliver a wide range of new products and services for Smart Homes.

Imagine a smarter home with computer vision AI.

A day in the life of a Smart Home

Consider the impact this could have in the day in the life of a future Smart Home dweller. We’ll call her Amy.

7:15 am

As Amy pulls out of her driveway, she’s confident that her security system will keep her home secure while she’s at work. When her husband leaves a little later, there’s just one other member of the family still at home: the family dog. Mr. Wiggles would do anything for his family, but as a ten-pound chihuahua, he isn’t much of a help in protecting their home.

The home’s security system recognizes Mr. Wiggles as a pet, so he doesn’t accidentally set off the motion detectors as he roams from room to room. Multiple cameras track him as he roams about the yard, but there’s no danger of Mr. Wiggles triggering a false alarm.

Later that afternoon, when someone approaches the front porch, the home uses facial recognition to determine if the individual is an authorized or unidentified person and monitors their movement. If they are lingering, the system can send Amy a notification or even engage an alarm system.

If they leave a package on the front porch, the system will recognize that there was an item left and notify Amy that there is a package waiting for her on her porch.

3:25 pm

When Amy’s son comes home from middle school, a smart doorbell recognizes his face. To open the door, all he has to do tell the smart doorbell to unlock the door. Amy receives a notification that her son has arrived at home and entered the house safely. Her son makes a beeline to the refrigerator and grabs a snack. An AI-enabled camera recognizes that the last hot pocket is gone and adds it to the virtual grocery list.

6:12 pm

As Amy pulls in the driveway, a camera recognizes the car and the license plate and opens the garage door. Amy’s arms are full of packages, but there’s no need for a key to get in. She simply tells the smart doorbell to open the door. The system confirms her identify via facial and voice recognition, deactivates the alarm, opens the door, turns on the lights, and adjusts the thermostat to her desired indoor temperature.

The AI that delivers on the promise of a Smart Home

Consider the Smart Home features highlighted in this story:

  • Being able to tell the difference between a family member, a stranger, and Mr. Wiggles
  • Locking or unlocking doors based on recognizing specific people
  • Sending an alert when an unidentified person is spending time around the house
  • Following objects across multiple cameras to track a subject moving from room to room
  • Identifying hundreds of inanimate objects including various types of food, vehicles and packages

All the capabilities featured in this story would have been difficult if not impossible to achieve without a new approach to AI.

Xnor’s combination of optimized pre-trained learning models and tuned algorithms give solution providers the power to deliver the functionality that makes Smart Homes smart. Visit us to learn more.

Andrew, one of Xnor.ai’s engineers, showing a demo of Image Segmentation running off a webcam video feed, using 60 MB of memory and just the CPU – no GPU necessary.

With so many meetings involving participants from multiple locations, it’s no surprise that video conferencing has quickly become an essential collaboration tool. Best-in-class solutions allow users to share screens, access other desktops, chat, exchange files, and communicate via digital whiteboards. When done right, these capabilities add up to more than the long-distance equivalent of a face-to-face meeting. They provide a platform for a participatory experience that can break down corporate silos and boost productivity.

However, traditional video conferencing is plagued with a long list of vexing issues. A cluttered office or background distractions can draw a viewer’s attention away from the speaker. Poor image quality can detract from the content being presented. Frustrated with these technical and experiential imperfections, participants often use the time to catch up on their email and lose focus on the meeting.

Introducing AI-powered image segmentation

Image segmentation improves video by identifying the boundaries of people and objects, and isolating those pixels to enhance the focus or brightness separately from the rest of the image. It’s a technique that’s been around for years, but until now, two factors have delayed wide adoption.

First, traditional image segmentation involves billions of floating-point operations. That requires a significant amount of computing power with a fast processor augmented with a GPU or an neural accelerator chip. Second, a lack of good training data and models make it time-consuming to achieve smooth output. And, when you do have enough data, training it successfully requires running on expensive cloud resources. Often, only a large company can afford to invest the time and resources necessary to build image segmentation into their products. Xnor’s segmentation technology overcomes these blockers to give video conference providers precise control to a world-class video conferencing experience. Here’s what makes our image segmentation technology so revolutionary:

Flexible deployment options

Xnor can perform real-time image segmentation on embedded devices running on a 1 GHz Arm processor. For complex AI tasks, Xnor can also take advantage of GPUs, accelerators and neural processors running on servers or in the cloud.

A revolutionary learning model

Xnor image segmentation partitions video frames into distinct regions containing an instance of an object. The object may be a person, vehicle, animal, or any one of hundreds of objects. The attributes for each type of object is derived using an image-based training model. Xnor’s technology uses optimized pre-trained models and tuned algorithms to achieve substantially higher performance and accuracy than other models. Our core neural network model is the fastest and most accurate in the industry. Together, these deep learning models and revolutionary algorithms enable AI tasks to execute, in real-time, on streaming video and on form-factors as small as mobile handsets.

Low processor requirements

Traditional object detection and segmentation requires an application to perform billions of floating-point operations. Xnor’s AI processing technology can execute up to 9x faster than other computer vision solutions by utilizing performance breakthroughs our researchers have discovered, such as YOLO object detection and XNOR-Net image classification. That kind of performance delivers an enhanced user experience on a wide variety of devices, including webcams, mobile phones, and even dedicated conferencing hardware running commodity processors.

AI image segmentation introduces new video conferencing capabilities

Xnor’s technology provides video conference providers with a new set of tools to enhance video conferencing, including:

Scene Optimization

Improve video quality by dynamically adjusting the exposure, brightness, contrast, and sharpness of different portions of the image.

Background Blur and Replacement

A successful video conference has to hold the viewer’s attention, but distractions can make that difficult. You may want to encourage users to focus on the speaker, or perhaps a speaker has recorded a presentation in their office – and the whiteboard behind them contains sensitive information.

With Xnor’s real-time image segmentation you can dynamically isolate people and objects in a live video, then superimpose them anywhere in either a 2D, VR, or even augmented reality.

See it for yourself

See how easy it can be to transform ordinary video into an experience that will engage your viewers from the first frame to the last. Visit us to learn more.

Imagine being able to create more focus in your video-conference, or transport users to a different world in a mobile app experience. Image segmentation, a computer vision machine learning task, makes this a reality by creating pixel-accurate image masks of detected objects. Computer vision is progressing at such a rapid rate that these tasks can now run on mobile handsets, and even Raspberry-Pi like devices with simple ARM processors. What’s most exciting is that developers can start creating these new experiences today. Let’s take a moment to think about what’s possible:

Social, Retail & Gaming Scenarios

Some of the most exciting new opportunities are in social, retail, and AR/VR. For social, gaming and photography apps — imagine superimposing users into completely different landscapes and scenery, or immersing them into a game. In retail, what if you could transport the user into a virtual fitting room or let them interact with products in a virtual showroom?

Image Segmentation for mobile & AR experiences

Productivity & Videoconferencing

Image segmentation can also enhance online meetings by eliminating background distractions. This is done by blurring out or completely changing the background in the video stream. This allows users to preserve privacy, make the environment appear more professional, or even make a conference call more productive by placing people together into a virtual conference room.

How It Works

Image segmentation partitions images and video frames into distinct regions containing pixels of an instance of an object. These attributes are derived by training models with images to identify different types of objects like people, vehicles, and animals. A binary mask of an image is created, which is represented by a black and white image showing where the segmentation algorithm finds a match.

Segmentation mask isolating the dancer in the frame

Improving Training Data & Performance Optimization

With Xnor’s real-time image segmentation you can dynamically isolate people in live video and superimpose them anywhere in 2D, VR, or augmented reality. Capable of running solely on the CPU of devices or servers, Xnor’s segmentation algorithm can also take advantage of GPUs, accelerators and neural processors.

This article by our CTO, Mohammad Rastegari, shows just one of the ways we are improving deep learning accuracy and performance on devices. Advances like these also power our image segmentation offering, executing efficiently enough to run on mobile handsets and streaming camera video. Internal benchmarks indicate our approach performs up to 9x faster than standard solutions.

Until now, segmentation has been difficult to accomplish due to the lack of accurate deep learning models and high processing requirements. This lack of good training data has made it nearly impossible, except for the largest companies, to invest the time and resources necessary to create deep learning models that can identify and segment people and objects with high accuracy.

Additionally, traditional object detection and segmentation tasks perform billions of compute-intensive, floating point operations which require bigger processors that are enhanced with GPUs or AI accelerator chips.

Xnor solves these problems by providing optimized pre-trained models and tuned algorithms that perform with higher performance and accuracy than other state-of-the-art models. By precisely training deep learning models and reducing the complexity of the algorithms, our AI scientists enable segmentation tasks to be executed in real-time on streaming video on form-factors as small as mobile handsets.

Want to learn more?

Visit us at www.xnor.ai or click here to learn more about image segmentation.

Today we’re featuring Toradex, one of our hardware partners who will be exhibiting at Arm TechCon this week in San Jose, California. If you’re at the conference, come visit booth #1134 to see a joint demonstration of Xnor running on Toradex’s efficient Arm system-on-modules. On a single board we have been able to get Xnor object-detection models running real-time on three cameras.

ai at the edge | xnor

Apalis iMX8 — Toradex’s Computer on Module with NXP i.MX 8 SoC

Toradex is the preferred computing solution provider for low to medium volume projects in the embedded industry, enabling customers with fast time to market and delivering low total cost of ownership. With over 3,000 customers including World Cup Rally car racing, Toradex products are designed to run 24/7 in critical applications and withstand harsh environments with extreme temperature ranges, high vibration and high humidity.

AI at the Edge

ai at the edge | xnor

AI-enhanced building security smart enough to recognize authorized personnel, differentiate between people, animals, and machinery — and track movement across cameras.

Our collaboration unlocks usage scenarios requiring edge AI tasks on resource-constrained and low-power hardware which were previously only available in the cloud or on specialized hardware:

Commercial Security: Identify authorized and unauthorized people based on face identification. Track package delivery; differentiate between intruders, pets, and trusted individuals.

Retail Analytics: Analyze flow of foot traffic, heat-map analysis, and customer sentiment

Manufacturing: Inspect output for consistency and quality control; send alerts when anomalous actions or behavior occur on production floor.

Reducing Latency, Power Consumption & Downtime

Together, Xnor and Toradex are enabling AI on edge devices where technology and the real world intersect, providing a fast, resilient, and autonomous solution that works without interruption. Many of the customer conversations we have are about their concerns about the downside of relying too heavily on the cloud for AI tasks.

The rationale we often hear is that dependencies on cloud connectivity increases latency and power consumption, reduces overall performance and increases the risk of downtime because of network and cloud outages — none of which are acceptable in mission-critical monitoring solutions where personal safety and property are at stake. Xnor running on edge hardware like Toradex enables tasks to continue running since they run on device — even advanced tasks like face identification running on very small models.

Eliminating Hidden Cloud Costs

ai at the edge | xnor

Additionally, executing real-time AI tasks on-device eliminates the cost of using an external cloud solution. The cost of utilizing cloud AI quickly adds up — even 30 minutes of daily cloud vision services can cost over $350/month and requires 1.4 terabytes per year of network throughput per camera.

Come visit the booth to see this in action. If you’re not at the conference stay tuned for updates where we will share more. Thanks!

ai at the edge | xnor

Come see Xnor on Toradex in action at Arm TechCon 2018!

Visit xnor.ai to learn more.

More refined labels at training yield higher accuracy for on-device models

Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, Ali Farhadi

Advancements in deep learning opened the possibility for any camera to operate as a smart sensor capable of seeing and understanding a range of visual content. Deep neural networks are now powering technology underlying most of our online interactions through social, shopping, watching TV and movies, managing our photos. What led to the rise of deep learning? One basic reason is the parallel advancement in cloud computing and the accessibility of GPU’s to train and run such large neural networks.

What happens when we move AI onto edge devices?

This accessibility to GPUs is why the early adoption of AI was mainly for problems that could be solved in the cloud. But there are many real-world applications that would benefit from the promise of AI, for example, smart devices in the home that can recognize food in fridges, map drone direction in the sky and improve the outcomes of search and rescue missions.

Smaller neural networks

The first approach that enabled deep neural networks to even run at the edge was to implement efficient network architectures that require radically less memory and compute. The three main approaches are:

Low-Precision Models

Part of what makes deep learning so computationally intense are the convolutional operations. A standard full precision model relies on 32-bit floating point convolutional operations, which for standard deep learning models with billions of operations means a massive level of compute is required to make even a single inference. Through our work on XNOR-NET, we have shown one approach to reducing the model size is to drive convolutions down all the way from 32-bit to 1-bit. This approach has demonstrated the ability to radically reduce the memory and compute of deep learning models. Furthermore, given the binary nature of the operations the models can be easily embedded into hardware that relies on binary operations.

Sparse Models

Another way to reduce the model size and the computation is using sparse parameters, where most of the parameters are assigned to zero as their value and very few will remain non-zero. This simply means that we can ignore all the computations for the zero parameters and only save the indexes and the values for the non-zero parameters. We have developed LCNN: Look-up based convolutional neural networks that benefit from the sparse structure of the model to improve the efficiency of the network.

Compact Network Design

MobileNets, pioneered by Google, are an example of a streamlined architecture that relies on depth-wise separable convolutions. These models make it easy to trade off between accuracy and latency dependent on the constraints of the model’s application.

How can we improve accuracy for edge models?

Optimizing network architectures was a critical first step in getting deep learning models to run on edge devices. Now that the problem of how to run AI on the edge has been solved, we have been focused on new ways to improve performance. One approach that has been largely overlooked until now is data labeling, so we asked the question:

Are the current data labels for standard mainstream AI tasks good enough for training the models running in resource-constrained environments?

Training Data for Object Classification

To train a deep learning model to classify objects we need to provide a set of images that have been “tagged” with those objects. Until now, the field of computer vision has relied heavily on standard images sets such as ImageNet. Because they work to a large extent, researchers have not questioned how effective the labels are at providing perfect levels of generalization from training datasets to a test set.

Challenge 1: Incomplete label for one image

Sample image labeled ‘Persian cat’ in ImageNet’s training set

This image has the data label “Persian cat,” which means that we are training to model to learn that everything in this image should be classified as a Persian cat. As humans, it’s trivial for us to see that this image actually contains a Persian cat playing with a ball, where “ball” is considered an object category in this dataset but not labeled in this image. We understand that the cat is a separate object from the ball, that the cat is significantly bigger than the ball, and that the cat is living and the ball is non-living. Despite all of this complexity in the image, with only one data training label, we are simply telling the network that this image = Persian cat. Hence, this labeling is incomplete.

Challenge 2: Inconsistencies from random cropping — cropping the wrong object

To prevent overfitting (where models perform poorly because of an inability to generalize to novel images), various training techniques have been introduced to prevent models from memorizing actual images. One approach commonly used is to take a randomly sampled crop from different areas of the image, known as crop-level augmentation, which can represent as little as 8% of the entire image. While this approach improves the model’s ability to generalize from training to test datasets, it introduces inconsistencies between the image label and crop when there is more than one object in the image. For example, in the image below, a random crop may contain the pixels from the ball, but be labelled “Persian cat”.

example of random cropping augmentation

Challenge 3: Inconsistencies from random cropping — similarities between crops

Another challenge of training models on image crops is that there are opportunities for objects with different image-level labels to look indistinguishable to the model, for example a patch of bread could look very similar to a patch of butternut squash. To maintain high-levels of accuracy, it’s important for the model to be able to distinguish between these two food categories.

Example of random crop from bread (left) and butternut squash (right)

Challenge 4: Similar or very different mis-categorizations deliver the same penalty to model

As humans, we have all looked at a little dog that could be a cat, or thought a big dog looks like a bear. Currently, the way that most models are designed is that any misclassification or mistake is penalized equally. The way current labels are determined does not provide any insight into related categories. For example, the model receives the same level of penalization if it mis-categorizes a cat as either a dog or as a mirror on a car.

Examples of the limitations from taxonomy dependency

Large models that have a large model complexity are able to learn despite these inconsistencies in training data, but when models are lower precision, sparser, or more compact, then these inconsistencies in the training data cannot be resolved and accuracy of the model suffers.

How can we create training labels that deliver higher accuracies?

Given the challenges in data labeling identified above, we proposed a new iterative procedure to dynamically update ground truth labels using a visual model trained on the entire dataset. This new approach is called Label Refinery and relies on a neural network model to produce labels with the following properties that are consistent with the image content:

  • Soft
  • Informative
  • Dynamic

Soft labels are able to categorize multiple objects in an image and can determine what percentage of the image is represented by what object category. For the cat and ball example above, we are able to classify the image as 80% cat and 20% dog.

Informative labels provide a range of categories with the relevant confidence, so that if something is mislabeled as a cat, you can know that the second highest category is dog.

Dynamic labels — this approach to labeling allows you to ensure that the random crop is labeling the correct object in the image by running the model dynamically as you sample over the image.

How does label refinery impact accuracy?

We evaluated the approach of label refinery on the standard Image-Net ILSRVC2012 classification challenge using a variety of model architecture.

The figure below shows the first label generated from Image-Net and goes on to show how the model refines the labels over time. The graphs show the line of “perfect generalization” where a model perfectly generalizes from training data to test. Our results show that with progressively more automatic label refining, the model performance moves closer and closer to perfect generalization. This trend towards perfect generalization is reflected in the accuracy performance table below.

Figure taken from Label Refinery paper

Table taken from Label Refinery paper

Why is Label Refinery critical for boosting accuracy at the edge?

The results for implementing label refinery show interesting findings. We found that large models with small generalization gap (differences between the accuracy of the train and the test) are able to better handle situations where data labels are imprecise. This means that running label refinery has less impact on these models. However, where we see the biggest boost in performance is on models that have been compressed down to optimize for small memory and compute or models with a large generalization gap. This is why Label Refinery is critical for boosting accuracy on edge models running in resource-constrained environments.

With this approach, combined with Xnorized network architectures, we can now create models that are small, power efficient, low latency, and have a high degree of accuracy that can power smart devices as small as a doorbell and as mobile as a drone.

Source code is available on GitHub for Label Refinery —


Paper on Archive

Source code is available on GitHub for XNOR-NET —


Learn how to deploy Edge AI models at Xnor.ai’s developer workshop

At the Edge AI Summit in San Francisco on December 11, we will show how we’re using the Label Refinery and multiple other algorithm and model optimizations to create real-time AI solutions on hardware as small as a Raspberry Pi Zero.