Menu
Dr. Claudio Fantinuoli
  • Home
  • Software
  • Publications
  • Conferences
  • CV
  • Blog
Dr. Claudio Fantinuoli
December 1, 2025December 1, 2025

The Expressiveness of Voices in Machine Interpreting

A few days ago, I was invited to speak at the Franco-German broadcaster ARTE, where one of the topics on the table was the expressiveness of AI-generated voices. It is a timely subject. Voices generated by machines are approaching a point of near-indistinguishability from human speech. Some critics refuse to believe this, insisting that synthetic audio still sounds “robotic”. But with the newest generation of models, distinguishing a real voice from a generated one is increasingly a matter of luck. If we take the threat of deepfakes seriously — and we should — then we implicitly accept that synthetic voices now reach a level of realism that our perceptual systems cannot reliably tell apart. Either these voices are highly convincing, or deepfakes would not be considered the societal risk they clearly are.

Part of the confusion stems from the fact that most observers of AI interpreting systems still evaluate them with assumptions from six months or a year ago. They listen to older demos, or to low-quality consumer tools, and conclude that machine voices remain stiff and mechanical. This is naïve. The pace of improvement is extremely rapid, and every few months shifts the boundary of what is possible. What was unthinkable only recently — expressive, real-time, low-latency1 speech — has already become mainstream in frontier models.

The question, then, is not whether machine voices can sound natural, but how expressive they need to be, depending on the application.

In AI dubbing, for example, voice expressiveness is not just important: it is the entire product. A voice must span a wide emotional palette, adapt to diverse narrative contexts, and be customizable enough to reflect subtle changes in tone or mood (see for example the beautiful idea of Emotional Compass by the Italian startup Voiseed). The acoustic quality also needs to match the setting, whether for a high-end media production or a casual social-media clip. In these workflows, we need to be remainded, a synthetic voice is judged by its ability to support the artistic and communicative goals of the production, not by how closely it imitates human internal mechanisms.

AI interpreting operates in a different communicative space. Voice quality matters — listeners do not want to endure hours of flat, monotonous audio at a conference. And here, too, progress has been exceptional. Today’s speech-to-speech systems can generate real-time voices with natural prosody and smooth intonation that would have been impossible half a year ago. That this level of expressiveness can be achieved while maintaining simultaneous-mode latency is a genuine technological milestone, as this is demonstrated in my prototype below.

Prototype of simultaneous interpreting system with high definition voice.

Once again, I would like to spend some words on what makes AI so effective. Machine systems do not need to reproduce the underlying human physiology (and cognition) of expressive speech to reach similar outcomes. Critics rightly note that AI approaches problems differently, a point echoed in recent reflection paper from European Language Council. But this difference is not a limitation, as they mistakenly frame it; it is a source of capability. As for translation and interpreting, in the specific case of voice generation the goal is not to replicate the human path to expressiveness, but to achieve expressive output, and machines have found their own route.

More importantly, we need to understand that interpreting is not a uniform activity, and expressiveness is not universally required. In conference environments, a pleasant, natural voice benefits the listener. But in many other settings — medical consultations, administrative interactions, asylum procedures — the priorities shift. The goal there is strictly functional: facilitating a clear, efficient exchange of information. Full stop.

In these situations, a highly “human-like” voice may not even be desirable. It risks projecting a social presence that the technology does not intend to embody. Users may prefer a neutral, unobtrusive voice — one that signals transparency rather than simulated personality. Moreover, in these environments, other factors become central: the ability to run models locally, to ensure that no data leaves the device or the controlled infrastructure, like in the case of the Swedish company Mable AI addressing medical interpreting with HIPAA compliance.

This is the technological future we should expect: a diversified ecosystem. Highly expressive voices for creative work. Natural, conference-ready voices for professional settings. Neutral, privacy-preserving voices for sensitive interactions. Different architectures, different levels of expressiveness, different trade-offs.

And within this ecosystem, human interpreters will continue to play essential roles — particularly where trust, accountability, relational dynamics, or embodied presence matter in ways that no synthetic voice, however fluent, can replicate. Machine voices are becoming more human-sounding. But the future of interpreting will not be defined by perfect mimicry. It will depend on understanding where expressiveness enhances communication, where it distracts from it, and where it simply does not matter.

  1. I would consider low-latency the biggest remaining challenge for voice synthesis. But a challenge that many are tackling more and more successfully. ↩︎

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

FREE NEWSLETTER

I write about how technology is transforming interpreting, dubbing, and multimodal communication.

  • December 1, 2025 by claudio The Expressiveness of Voices in Machine Interpreting
  • November 28, 2025 by claudio Real End-to-End Speech-to-Speech Translation is among us
  • November 21, 2025 by claudio "AI just swaps words" - Rethinking Semantics in the Age of AI
  • November 17, 2025 by claudio What is the real uptake of AI Interpreting?
  • November 12, 2025 by claudio What Lies Beyond Meta and Translated’s Advances in Supporting Low-Resource Languages

E-mail me: info@claudiofantinuoli.org

Claudio Fantinuoli is professor, innovator and consultant for language technologies applied to voice and translation. He founded InterpretBank, the best known AI-tool for professional interpreters, and developed one of the first commercial-grade machine interpreting systems.

2025 Claudio Fantinuoli