Goal
Efficient wake word recognition on microcontrollers with Cortex-M55 and Helium technology for use in consumer and automotive products that include more and more AI features for voice applications.
Challenge
Companies want to create their own branded voice experiences and strengthen their relationship with their customers. When seeking to implement a custom, branded wake word, there is always a compromise between the desired accuracy and the resources required to achieve it. On a constrained device, this is a particular challenge because the memory and processing resources are limited by the available hardware. Also, when designing for battery-powered devices, energy efficiency is a primary concern for always listening wake word implementations. Customers expect performance equal to or better
than established smartphone and smart speaker experiences. The challenge is to achieve the best possible accuracy within the constraints of the platform.
Solution
Sensory’s TrulyHandsfree: Always-Listening Embedded Speech Recognition
Sensory’s wake word and phrase spotting technology is known for fast response, low power consumption, and excellent performance from a distance or in noisy environments. This technology is an integral component for fully featured voice control of devices in the home, car, and anywhere voice user interfaces could be deployed.
The combination of Sensory’s optimized software with the performance of the Arm Cortex-M55 processor is a compelling solution.
Benefits
- Best in class performance (false reject, false accept) for an optimal user experience.
- Supports dozens of languages and enables global voice coverage.
- Flexible, model sizes from 1MB to as small as 40KB, which can be customized for DSP, microcontroller, or applications processor-based products.
Learn more about the products we love at Sensory:
https://www.sensory.com/featured-products/.
Design challenges
Microcontrollers are typically designed for resource-constrained applications. As more and more features and functions are squeezed into products that utilize AI on the endpoint, Sensory’s engineers are constantly challenged to do more, with less. For TrulyHandsfree, this translates into maintaining optimal wake word accuracy with a reduction in MHz. To provide a successful voice experience and unlock a whole new range of applications where ML/AI at the endpoint is pushed even further, our team needed to explore the possibilities with a microcontroller capable of handling DSP-type workloads in an efficient way. We needed to investigate and quantify the benefits that could be directly applied to keyword spotting products such as true wireless earbuds, wearable health trackers, smart speakers and video doorbells. Products which all need to stay competitive, by packing in more and more AI features.
Design implementation
Arm Cortex-M55 offered a solution that would enable Sensory to bring AI to more devices and people in the most efficient way. As we have been working on Cortex-M4 and have had a long history working with Arm’s toolchain, our engineering team jumped on the opportunity to port our existing TrulyHandsfree software to Cortex-M55.
The Cortex-M55 processor brings more performance, simplified software development, and an extensive ecosystem, which enabled Sensory to conduct investigations in an efficient way and enabled us to quantify the benefits of Cortex-M55 for the keyword spotting applications.
Before we jump straight into the Cortex-M55 benefits, it is probably best to revisit a few high-level topics on how to measure wake word accuracy and how model size and MHz come into play.
In general terms, wake word accuracy is represented as an operating point on the performance graph. The operating point is meant to provide a balance between false reject (FR) percentage and false accepts (FA) per day. You can see this on the chart below.
There are many different factors when defining the actual operating point, but in general the balanced approach is desired.
The recurring theme of doing more with less, for AI at the endpoint presents the greatest challenge and often means that efficiency or accuracy has to be sacrificed. For example, if developers are limited in processing power and require fewer MHz for wake word detection, then the typical solution would be a smaller wake word model which results in lower frequency, but also lower accuracy.
This is not ideal, but sometimes a design trade-off that has to be made. The table below shows an estimate on how model size, MHz, and False Reject rate on a Cortex-M4, which we have been using in our current solution. For comparison purposes, the wake word engine is being exercised in a variety of background noise profiles and each False Reject rate is referenced to a fixed False Accept rate of three events per day.
With each reduction in model size there is a significant reduction in MHz, which is also accompanied with a loss of accuracy. A loss of accuracy that creates the potential for a higher occurrence both False Rejects (FR) and/or False Accepts (FA).
However, things are now different with Cortex-M55. Arm’s new Cortex-M55 includes Arm Helium technology, a new vector instruction set extension that provides a significant uplift when applied to Sensory’s TrulyHandsfree. When compared to Cortex-M4, which is currently used in TrulyHandsfree, Cortex-M55 provides equivalent wake word accuracy, but with an average reduction in MHz of 73%.
This is an initial analysis, Sensory expects that even further optimization may be achieved with Cortex-M55.
Working with Sensory’s TrulyHandsfree and Arm’s Cortex-M55, developers no longer face the trade-off of decreased accuracy for lower frequency.
When paired together, the solution enables developers to free up resources and simultaneously maintain model size for industry leading, wake word accuracy.
With such a dramatic reduction in MHz, some developers may even choose a more accurate model and still leverage a lower frequency. From a software perspective, developers can take advantage of the simplified experience working with a single toolchain and a familiar software development ecosystem. Cortex-M55 is an efficient solution for workloads for which you would often use a DSP, but now you do it with only one processor and a familiar toolchain
Sensory and Cortex-M55 products provide a premium user experience for wake word spotting and, with a lower frequency, they open up the resources for powerful new features like Sound Identification or gesture control.
Arm and Sensory empower developers to do more with less. Saving compute resources leads directly to greater energy efficiency and enables new use cases. For many constrained use-cases, energy usage is of critical importance. The inclusion of Helium vector processing technology in the Cortex-M55 processor enables significant improvements in terms of useful work per Joule when running DSP and ML workloads. Simulations carried out by Arm show an improvement of over x3 in average energy efficiency compared to Cortex-M4, measured over a range of common DSP kernels.