Modern robots are no longer just executing commands. They are becoming interactive, conversational systems that listen, recognize, respond, and adapt in ways that feel more natural to the people using them.
For a recent webinar, “Beyond Beeps and Boops: Giving Voice to Modern Robots,” Sensory brought together panelists with deep experience across speech, audio, and robotics:
The conversation covered everything from wake words and speaker identification to multimodal sensing, cloud vs. edge tradeoffs, and the future of embodied AI.
A major theme in the discussion was that robots have to work in messy, noisy environments, not just in ideal demos. Gal-Oz described how his team is building a robot that lives in a senior’s apartment and communicates entirely by voice, noting that “the communication with the robot by the senior is through voice only, no touching” and that they protect privacy by doing a lot on the robot, meaning the system does not transfer voice or user data over the internet. He also explained why their wake word design had to account for a demanding audio environment: “there is loud TV, all the time”.
That reality makes interaction design as important as model quality. As Beckmann put it, “If we want an interactive robot, audio is important,” and the system has to ensure the right signals get passed through the rest of the stack. The panel agreed that natural behavior is not about removing all structure, but about making the structure invisible to the user.
Wake words came up repeatedly as a practical necessity, but not a perfect one. Mozer explained that Sensory has been trying to reduce the need for rigid wake-word behavior while still preserving user control and natural interaction. The audience poll showed a strong preference for wake words with custom naming, which Mozer found interesting because customers often default to brand-specific choices while consumers want something more personal.
Gal-Oz noted the limits of flexibility in real deployments, saying that while users may want to choose a name like “Mike,” that can fail because of how common the word is in everyday speech. Beckmann added that in normal human interaction, “there’s always a wake word, at least at the very beginning,” and that a system may also need other wake triggers, such as sound detections or safety events like a fall.
The panel spent a lot of time on the edge vs. cloud decision. Mozer argued that some functions, especially speech-to-text, can be very strong on-device if the model is compact enough, while large language models still tend to perform better in the cloud. He said that once speech-to-text gets above roughly “20 or 30 megabytes,” it can be “pretty state-of-the-art” on-device.
Gal-Oz explained that his robot runs with limited onboard compute, so privacy and performance force a hybrid approach: local processing for certain tasks, cloud processing when the conversation becomes more complex. The panel also discussed caching repeated text-to-speech phrases on-device to reduce latency and cost.
Another standout topic was personality. Gal-Oz said that “personality is critical” and that the system should tune responses to the traits of the person using it. Some users want humor, others want something drier, and some want a friend while others want an assistant. That idea connected with Mozer’s observation that products like Pi resonated because they let users choose voice and personality.
Pieraccini pushed the conversation further by pointing out the emotional side of product design. He recalled how users bonded deeply with Jibo and described it as “the Tamagotchi effect,” where people form real attachment to machines that show human-like behavior. That, he argued, is part of why robotics is moving from being merely functional to being emotionally legible.
The panel closed by looking ahead to the next stage of robotics. Beckmann said she likes the idea of helper robots that live in specific spaces — a refrigerator, a kitchen, a room — rather than a humanoid walking around the home. Roberto agreed, saying that helper robots are more likely to become common than humanoids, which still face major issues around safety, charging, cost, and acceptance.
The takeaway from the webinar was clear: the future of robots is not just about smarter models. It is about building systems that can understand context, stay safe, respect privacy, and interact in ways people naturally trust. As Pieraccini put it, “we don’t need to adapt to the technology, but it’s going to adapt to us”.
Learn more from this panel by watching the full webinar recording, or learn more about Sensory’s solutions for embodied robots here.