The 3-phases roadmap of AI Interpreting: Moving towards phase 2.

AI or machine interpreting refers to the use of software to translate one spoken language into another (including sign language) in real-time, without human intervention or post-editing. It’s designed for immediate, dynamic communication, whether remote, face-to-face, simultaneous, or consecutive. Speech translation technology has a fairly long history and has been evolving rapidly in recent years, as I described in my latest article (Fantinuoli, 2025).

At a high level, we can identify three key technological phases in the evolution of AI interpreting:

Phase 1: Plain Cascading Systems: In this phase, interpreting systems are composed of sequential modules: speech recognition, machine translation, and speech synthesis. These components are loosely or tightly integrated, but the translation remains largely direct, with minimal contextual awareness. While this approach is already in production and performs well in structured, clear speech scenarios, with many early adopters even in professional contexts, its limitations become apparent in dynamic, spontaneous conversations, the essence of real interpreting.
Phase 2: Contextual Intelligence with Generative AI: Phase 2 builds on the cascading approach but integrates Generative AI to enhance contextual understanding. Large Language Models (LLMs) and agentic AI bring reasoning, prompt-based translation, and awareness of the communicative situation. This represents a major quality leap, as translation becomes semantically grounded and sensitive to nuance, even if still limited to the semantic channel. This is where we begin to approach the “Turing Test” of speech translation: output that feels natural, responsive, and human-like.
Phase 3: End-to-End Multimodal Models: In the third phase, systems move toward fully end-to-end models, akin to large language models but capable of accepting and generating speech directly. These multimodal systems promise the highest potential quality by using all available input channels (except vision, for now) to inform translation, including prosody, emotion, reasoning, and discourse coherence. Prompting becomes the central mechanism, offering maximum flexibility and adaptability.

For the sake of clarity, the three phases have not a clear cut, the edges are rather blurred, and a real-life system might be a mix of approaches.

Where Are We Now?

Currently, Phase 1 technology is in production in, I would say, all real-life scenarios; research is focusing on experimenting with Phase 2, i.e. integrating LLMs (see for example Koneru et al. 2024 and Koshkin et al. 2024) and other advanced modules inside a complex pipeline of components to perform speech-to-speech translation. We are likely to see early productization of Phase 2 within the next year, then the advantages od LLMs, or more generally speaking generative AI, are expected to be considerable. Phase 3, while still beyond current capabilities, is a foreseeable milestone within a few years, though the timeline from lab to market remains highly uncertain.

Phase 1 systems work reasonably well in controlled settings, but interpreting as a communication act is inherently dynamic, with all the complexity of the case. Direct translation, by its nature, cannot fully replicate the nuances of live human interpreting. That’s where Phase 2 technology becomes a game-changer.

Towards Phase 2: The Big Leap Forward

The integration of LLMs and agentic AI into interpreting pipelines will with high probability redefine the field, and the practical applications used by people. These systems can generate translations that are context-aware, conversationally grounded, and, when combined with expressive neural voices, remarkably natural. While computational demands, especially for simultaneous interpreting, remain a challenge, deployment is becoming increasingly feasible. LLMs are now fast, relatively affordable, and widely accessible (see for example my participation to this podcast on the topic of Deepseek model). More importantly, from an architectural perspective, the switch from current neural machine translation to LLM-based translation is swift.

With many researchers already exploring LLM-based pipelines and the gap between research and production narrowing, 2026 is shaping up to be the year when LLMs become the core of high-quality AI interpreting systems.

The quality, as we see in experiments also from written translation, of what is possible with LLMs as the core of the translation pipeline is so good that Phase 2, and not Phase 1, represents the true benchmark against which AI interpreting should be compared to professional human interpreters, the gold standard in the field. In this phase, AI interpreters are expected to match human performances in some aspects of the performance and even reach near-superhuman levels in specific areas such as terminological precision (especially in highly specialized settings like medical or scientific conferences), information density, and factual accuracy. This isn’t hard to imagine also without seeing the experimental data we have: an AI system can draw instantly from a vast body of world knowledge, something no human, no matter how prepared, can fully replicate, even with extensive preparation in a narrow domain. In terms of contextual awareness and terminology precision, Koneru et al., in fact, state: “This highlights LLMs’ potential for building next generation of massively multilingual, context-aware and terminologically accurate SiMT systems that require no resource-intensive training or fine-tuning” (2024).

However, what remains uncertain is how effectively this knowledge will translate into better communication in practical terms. Here, LLMs seem to still face limitations, particularly in the fluid, adaptive process of real-time speech. In contrast, experienced human interpreters excel at navigating ambiguity, adapting to rapidly changing contexts, and managing unexpected disruptions, skills that are deeply rooted in human cognition and interaction. In depth analysis from a communicative perspective, something that Interpreting Studies as a discipline is best suited to explore will be urgently needed.

Of course, many challenges remain on the path to integrating LLMs for production-ready devices: language coverage, concurrent multilingual output, latency, and other engineering hurdles must be addressed. But these are solvable, and their resolution will depend on smart product design and business priorities, rather then principles of technological feasability.

Phase 2 will also see other possible innovations next to the integration of LLMs, such as spatial translation to effectively rendering the translated speech considering spacial cues in multi-speakers contexts (Chen et al., 2025), expressiveness of translation (Zhang et al., 2024), on-device deployment (Agranovich, 2024), etc.

Conclusion: The Die Is Cast

From a quality perspective, the direction is clear. LLMs will become the foundation of the next generation of speech translation tools. They won’t be perfect, but they will feel real, responsive, and human in ways Phase 1 systems simply cannot achieve. As the field evolves, users and customers alike can expect more natural, context-aware interpreting experiences, making machine interpreting a powerful and practical option in a growing number of scenarios. Let’s wait for the the more results expected at the International Conference on Spoken Language Translation 2025, happening this summer in Vienna. The practical implementation, I am convinced, will follow swift.

Bibliography

Agranovich, A., Nachmani, E., Rybakov, O., Ding, Y., Jia, Y., Bar, N., Zen, H., Ramanovich, M.T., 2024. SimulTron: On-Device Simultaneous Speech to Speech Translation. https://doi.org/10.48550/arXiv.2406.02133

Chen, T., Wang, Q., He, R., Gollakota, S., 2025. Spatial Speech Translation: Translating Across Space With Binaural Hearables, in: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25. Association for Computing Machinery, New York, NY, USA, pp. 1–19. https://doi.org/10.1145/3706598.3713745

Fantinuoli C. “Machine Interpreting”. In Sabine Braun, Elena Davitti and Tomasz Korybski (ed.) Routledge Handbook of Interpreting, Technology and AI. Routledge (2025)

Koneru, S., Binh Nguyen, T., Pham, N.-Q., Liu, D., Li, Z., Waibel, A., Niehues, J., 2024. Blending LLMs into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024, in: Salesky, E., Federico, M., Carpuat, M. (Eds.), Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024). Association for Computational Linguistics, Bangkok, Thailand (in-person and online), pp. 183–191.

Koshkin, R., Sudoh, K., Nakamura, S., 2024. LLMs Are Zero-Shot Context-Aware Simultaneous Translators. https://doi.org/10.48550/arXiv.2406.13476

Zhang, S., Fang, Q., Guo, S., Ma, Z., Zhang, M., Feng, Y., 2024. StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning. https://doi.org/10.48550/arXiv.2406.03049

Where Are We Now?

Towards Phase 2: The Big Leap Forward

Conclusion: The Die Is Cast

Bibliography

Leave a Reply Cancel reply