The industry is fixated on trillion-parameter models and headline-grabbing benchmarks. Although those models are doing important work,there’s a quieter kind of heavy lifting that makes them usable in the real world: tiny, ruthlessly efficient models running on devices at the edge.
If you care about latency, bandwidth, battery life, and real user experience, you can’t just throw everything at the cloud. You need a hybrid stack: on-device wake words, speaker verification, and speech-to-text feeding compact, intent-level data into the big models instead of raw audio streams. Sensory’s business is to handle that “other” hard part so you can stay focused on the giant models in your data center that actively respond to the subject matter.
Starting with the basics: 16-bit, 16 kHz mono audio is the common fidelity required for speech recognition, which is about 256 kB per second of raw data. Even with compression, continuous streaming of voice can chew through megabytes per minute per user, plus all the radio and server power required to carry and process it. At scale, that involves transporting and ingressing a massive quantity of data that is not just a cost issue; it’s a reliability and coverage issue.
Now contrast that with sending text: A typical voice request might be 5–15 words. That’s often just a few hundred bytes of UTF-8 text,easily 100–1000× smaller than the corresponding audio, depending on codec and duration. In marginal coverage (one bar of LTE, satellite, congested Wi-Fi), those tiny packets of text still get through when continuous audio streaming simply doesn’t.
On-device ASR and NLU effectively act as a bandwidth and reliability amplifier for your large model backend.
The story isn’t “cloud versus on-device.” It’s “use the right model in the right place and sync them together”.
A pure-cloud architecture asks your backend models to do everything: wake word/intent, diarization, speaker verification, ASR, NLU, and state, context and logic management. That’s flexible, but it’s also fragile, expensive, and hard to make responsive enough to feel “instant” or run in ways that can interact with devices.
Wake Word advancements have enabled a common power-sipping hybrid approach that listens for a wake phrase at very low power and only wakes up the edge application processor for on-device speech processing and transferring data to the cloud when it is needed. The hybrid approach splits responsibilities as follows:
In practice, this yields a tiered architecture. Tiny models run continuously, gating access to heavier processing. Intermediate representations (text, intent, embeddings) go upstream only when necessary. Large models in the cloud stay focused on the genuinely complex tasks: reasoning, composition, long-context understanding,transaction enablement.
Done right, you reduce:
Sensory has been enabling the edge side of this problem for years: wake words, embedded ASR, speaker verification, and sound identification—on phones, cars, wearables, and dedicated IoT silicon. Its wake word tech is deployed across major OEMs and SoCs in several billion devices and has already gone through the pain of squeezing models into constrained cores while keeping accuracy high and power low.
A few reasons Sensory is the right partner if you’re building the cloud LLM piece:
You don’t want your LLM team rewriting wake word engines or debugging quantization artifacts on random embedded DSPs. That’s a different kind of deep work requiring a specialized knowledge and experience that is very familiar to Sensory.
If you’re a cloud provider or platform owner, your differentiator is the experience and intelligence you can build on top of the text and intents coming into your trillion-parameter models.
“Stick to the heavy lifting” should mean focusing your best people on those big models and the orchestration around them. Let Sensory stick to the “micro” heavy lifting: the tiny, domain-specific, ruthlessly optimized on-device models that make your stack feel fast, private, and reliable to end users.
In a world obsessed with parameter counts, the systems that win will be the ones that optimize user experience across the entire chain of devices—cloud models where scale matters, and on-device models where every milliwatt and every kilobit counts.