The idea of a machine that listens to speech in one language and instantly speaks it back in another — all in real time, and without missing a beat — has long captured the imagination of researchers, technologists, and organizations working in multilingual communication. This is the vision behind end-to-end machine interpreting: a single AI model that takes speech as input and produces translated speech as output, without breaking the task down into smaller, separate components.
But while this technology holds enormous promise, the reality today is clear: end-to-end machine interpreting isn’t ready for production use yet. The systems we have now, though groundbreaking in design, simply don’t match the reliability, accuracy, or scalability that real-life settings demand.
What’s Machine Interpreting Today?
Let’s start with where we are. The systems powering AI interpreting today follow a tried-and-tested approach known as the cascaded pipeline model. First, an automatic speech recognition (ASR) engine transcribes the spoken input into text. That text is then fed into a machine translation (MT) engine, which converts it into the target language. Finally, a text-to-speech (TTS) system generates speech output. The combination of the three can vary greatly, not only in terms of the individual components, but also in how these components interact with one another, leading different systems to perform very differently from each other1.
This modular setup has served us well. It allows developers to combine best-in-class components, customize systems for specific languages or domains, and upgrade parts of the system without having to start from scratch. But it comes with some drawbacks too. Errors can accumulate at each stage, a misrecognition in ASR leads to a mistranslation in MT, which in turn results in awkward or incorrect speech synthesis. What’s more, much of the richness of the original speech — things like pitch, intonation, emphasis, emotion — is lost along the way.
The End-to-End Vision
Enter end-to-end machine interpreting. The idea is simple: instead of breaking speech translation into separate steps, why not train one single AI model to handle the entire task? The model would listen to the source speech and produce translated speech directly, in one integrated process.
In theory, this could unlock huge benefits. Such a model could make use of all the subtle cues in the original speech: not just the words, but the speaker’s tone, rhythm, and intent. And by eliminating handoffs between separate modules, we could reduce the risk of error propagation and perhaps build systems that are easier to maintain in the long run. In theory, also latency could be reduced since less components are at stake.
Where Are We Now? The Research Landscape
The vision is compelling, but turning it into reality is proving challenging. So far, two major research efforts stand out:
- SeamlessM4T by Meta AI
Meta’s project aims to build a universal model capable of translating speech in dozens of languages, both into text and directly into speech. The goal is flexibility; a single system that can handle a range of translation tasks across languages and modalities. - Translatotron by Google Research
Google’s Translatotron was one of the first serious attempts to build a direct speech-to-speech translation model. The system generated speech output that not only translated the words but also tried to preserve characteristics of the original speaker’s voice. Translatotron 3 refined this further, improving stability and voice preservation.
Both are extraordinary feats of engineering. But both are also research prototypes, not production systems. They demonstrate what’s possible in controlled conditions, but they aren’t ready to replace cascaded systems in the real-life interpreting settings that are served by these.
Here is a short video of a recording of me speaking in Italian, processed by the Seamless model2. The model generates the transcription, translation, and translated audio directly.
Why the Excitement?
Why do researchers and companies keep pushing for end-to-end interpreting? Because the potential upside is huge.
Imagine a system that could translate speech while picking up on how something is said, not just what is said. Prosody — pitch, intonation, stress — carries meaning, and in some contexts (think of diplomatic speeches, or emotionally charged statements) it’s critical to make sense of a statement. An end-to-end system could, in principle, capture all of this, delivering a more faithful, natural translation.
There’s a parallel here with the leap from traditional neural machine translation (NMT) to today’s large language models (LLMs). LLMs don’t just translate: they (sort-of) reason, contextualize, and generate language that’s better embedded in world knowledge. Similarly, end-to-end interpreting could one day offer translations that are more nuanced, context-aware, and therefore human-like.
Ready for production time?
As exciting as all this sounds, we’re not there yet. And the obstacles are significant.
First, data. Training an end-to-end speech model requires vast amounts of speech data in different languages, typically (but not exclusively) original speech aligned with translated speech. This kind of data is scarce, expensive to collect, and much heavier to process than plain text. Compared to building text-based models, it’s an order of magnitude more demanding. Note: besides the issue of missing data, there are also compelling quality issues in available speech datasets, like demonstrated by Lau et al. (2025).
Then, there’s computational intensity. End-to-end speech models are large and resource-hungry. Running them in real time, at scale, across multiple languages, isn’t feasible yet with current infrastructure, at least not in a way that’s cost-effective and robust enough for real-life interpreting contexts. Continuous (ie. simultaneous) interpreting is still more computational dense, and therefore (almost) out of the game for now.
Finally, quality. Even the best end-to-end systems today don’t match the accuracy and fluency of cascaded models. In interpreting, where precision matters, this is a dealbreaker. They are impressive when used, but they are impressive especially because you know how they work.
Looking Ahead
So where does this leave us? The future is almost certainly heading toward more integrated, end-to-end approaches. Moving beyond small, specialized modules toward models that can handle more of the task internally is the natural next step, just as we’ve seen in other areas of AI. But making that future a reality is likely to take time.
What’s probably needed is not just more data and compute, but a paradigm shift in AI itself: toward systems that can learn more like humans do, from experience rather than sheer data volume. The new kid on the block will be multimodal models that can process not only audio, but also video and the like, in other words AI systems designed to gain a grounded understanding through direct interaction with the physical world (see a description of the DVPS research project at Translated). Until that happens, cascaded systems will remain the workhorses of AI interpreting, while end-to-end systems continue to evolve in the lab.
Conclusion
For organizations exploring AI interpreting, the message is clear: watch this space, but don’t plan on deploying end-to-end machine interpreting just yet. The technology is promising, the research is fascinating, and the breakthroughs will come. But for now, the reliable choice remains the cascading systems that have proven themselves in real-world use. The next generation of cascading systems with large language models is approaching, promising a major advance in translation grounding and overall quality.
Bibliography
SEAMLESS Communication Team et al., 2025. Joint speech and text machine translation for up to 100 languages. Nature 637, 587–593. https://doi.org/10.1038/s41586-024-08359-z
Nachmani, E., Levkovitch, A., Ding, Y., Asawaroengchai, C., Zen, H., Ramanovich, M.T., 2023. Translatotron 3: Speech to Speech Translation with Monolingual Data. https://doi.org/10.48550/arXiv.2305.17547
Fantinuoli C. “Machine Interpreting“. In Sabine Braun, Elena Davitti and Tomasz Korybski (ed.) Routledge Handbook of Interpreting and Technology. Routledge (2025)
Lau, M., Chen, Q., Fang, Y., Xu, T., Chen, T., Golik, P., 2025. Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning. https://doi.org/10.48550/arXiv.2506.17525
- A demo of the “most simple” cascading system can be found on my page here: https://machine-interpreting.com/ ↩︎
- A demo of the Seamless model is available here: https://seamless.metademolab.com/demo ↩︎