For decades, the “Holy Grail” of speech recognition has been the ability to process natural language entirely on-edge, without the privacy risks or latency of the cloud. However, the industry has long been stuck in a compromise: either use massive, power-hungry processors for high accuracy, or settle for “Command & Control” triggers that fail the moment a user deviates from a script.
Today, Sensory is breaking that compromise. By marrying our high-accuracy Speech-to-Text (STT) models with LiteRT Micro (formerly TensorFlow Lite Micro) and a “NPU-First” architectural philosophy, we are delivering a new class of “Tiny STT” that fits into the smallest footprints imaginable.
In the world of embedded AI, the Neural Processing Unit (NPU) is often treated as a “nice-to-have” accelerator. Traditional STT engines often “ping-pong” data between the CPU and NPU because their neural network operators aren’t fully supported by the hardware. This creates massive overhead, increases latency, and drains battery life.
Sensory’s approach is fundamentally different. Our STT models achieve 100% NPU operator mapping. This means:
We’ve optimized our STT engine into two distinct, high-performance profiles that redefine what “small” means for speech recognition:
By adopting LiteRT as our essential runtime layer, Sensory provides developers with a standardized, future-proof integration path. This allows our STT technology to deploy seamlessly across the world’s most popular embedded platforms:
| Silicon Partner | Supported Platforms |
|---|---|
| Arm® | Cortex-M Series (M4, M7, M55) and Ethos™-U NPUs |
| Cadence® | Tensilica® HiFi 4, HiFi 5, and HiFi iQ DSPs |
| Espressif | ESP32 Series |
| NXP | i.MX RT Crossover MCUs |
| Development Boards | Arduino Nano 33 BLE Sense, Sony Spresense |
Q: What is LiteRT, and why is it used for Speech-to-Text?
A: LiteRT Micro is the evolved version of TensorFlow Lite Micro. It is a high-performance runtime designed specifically for executing machine learning models on microcontrollers and other resource-constrained devices. By using LiteRT, Sensory ensures that our STT models can run on devices with minimal memory without needing an OS or dynamic memory allocation.
Q: How does Sensory achieve 100% NPU operator mapping?
A: Most neural networks use a variety of mathematical operations (kernels). If a hardware NPU doesn’t support a specific operation, the system “falls back” to the CPU to finish the calculation. Sensory meticulously designs its STT architectures to use only the specific operators supported by edge NPUs like the Arm Ethos-U, ensuring the CPU never has to intervene during active transcription.
Q: Can these models run on standard Arduino or ESP32 boards?
A: Yes. Because the engine is compatible with LiteRT for Microcontrollers, it can be deployed as a standard C++ library. It has been tested on popular platforms, including the Arduino Nano 33 BLE Sense and Espressif ESP32.
Q: What is the benefit of “Domain Adaptation” in the 2.7MB model?
A: Domain adaptation allows a small model to achieve the accuracy of a much larger one by focusing its “intelligence” on a specific set of vocabulary or environmental conditions (like car cabin usage and noise). This makes it possible to have highly reliable voice control on hardware that traditionally could only handle simple keyword spotting.
Q: How does on-device STT improve user privacy?
A: Because Sensory enables the model to run entirely on the local hardware, no voice data or audio recordings are ever transmitted to a cloud server. This eliminates the risk of data intercepts and ensures that the device can operate in “comms-denied” environments without losing functionality.