Real End-to-End Speech-to-Speech Translation is among us

Only a few years ago, end-to-end speech-to-speech translation (S2ST) seemed like one of those technologies that belonged to conference talks and research papers rather than real products. When Google introduced Translatotron in 2019, it was a glimpse of what might one day be possible: translating speech directly into speech, without detours through text, and even preserving the speaker’s voice. It was brilliant. It was fragile. And almost nobody believed it would be powering consumer devices any time this decade. Yet here we are.

The new Google Pixel Live Translate and Google Meet’s translated audio are built on precisely that kind of end-to-end model. What looked like distant research has quietly crossed the threshold into production. This is not merely an incremental improvement in translation quality. It represents the arrival of a fundamentally new class of real-time language technology. And, at least for some languages, it is running on devices ordinary people carry in their pockets.

From Cascaded Systems to Direct Audio-to-Audio Translation

Traditional speech translation works like a relay race: speech is turned to text, text is translated, and the target text is converted back to speech. This approach has many advantages, but each step introduces delay and error. A typical cascaded pipeline adds four to five seconds of latency. While this might be acceptable for one-to-many scenarios such as lectures, presentations etc., it is far too slow for natural dialogue. While recent improvements in voices are incredible, the voice produced by TTS is also generic, losing all personal characteristics of the original speaker.

End-to-end S2ST collapses all of that. A single model listens in one language and speaks in another, producing audio tokens directly. This is why the Pixel’s translations feel different: the timing is smoother, the intonation more natural, and the voice surprisingly similar to the speaker’s own.

When me and Renato Beninatto tested the Pixel (video here), we suspected — almost reluctantly — that we were hearing an end-to-end model. The style of translation and the subtle preservation of voice qualities were telling. But we were unsure because most experts believed the technology was still in its infancy, certainly not ready for mass deployment.

In their latest technical blog, Google has now confirmed it: their system is end-to-end, streaming, personalized, and real-time. Let’s have a look at it and try to figure out the implications.

The Two-Second Breakthrough

The headline is the two-second delay. That’s extremely close to human simultaneous interpreting. Let’s keep in mind that reality is always a bit different than announcements. But when you try it out, latency is really lower than current cascading approaches. Achieving this required innovations in data alignment, model architecture, and deployment.

Google constructed a large-scale data pipeline to precisely align source audio with translated audio. This ensures the model learns not only what to say, but when to say it. The training data is filtered aggressively so that only examples with reliable alignment and acceptable delay remain. The system learns to stream audio, deciding on the fly when it has enough information to produce translated output.

Conversely, most commercial cascading systems wait for a sentence boundary (even if it is possible to reduce it to a sub-sentence level with equal quality, as I demonstrated here). A streaming end-to-end system waits for a few hundred milliseconds instead. This is what allows the model to maintain conversational flow instead of translating in blocks.

Audio Tokens and Streaming Transformers

Technically, the system uses a hierarchical audio tokenization method derived from AudioLM and Google’s SpectroStream codec. Instead of generating waveforms directly, the model predicts compact units (RVQ tokens) that encode short audio segments. Sixteen tokens can represent roughly 100 ms of high-quality speech. Predicting tokens instead of raw waveform samples massively reduces computation, making streaming viable.

From the input side, a streaming encoder processes about ten seconds of audio context at any given time, giving enough context window for meaningful translations. From the output side, a streaming decoder generates translated audio tokens autoregressively, relying on the encoder state and previously generated tokens. A special text token is included for training efficiency, allowing quality to be measured without running a speech recognizer on the output.

This architecture becomes the backbone of real-time S2ST.

How It Fits on a Phone: Quantization and New On-Device AI Chips

One of the quiet revolutions in this story is not the model itself, but the fact that it runs on the edge. Google’s Pixel uses a combination of custom silicon (the Google Tensor G3 and now G4 chips) and advanced quantization techniques to make the model small, fast, and energy-efficient. Large S2ST models are computationally heavy; running them in real time on a mobile device was unthinkable just a few years ago.

The result is significant: fluent cross-lingual audio translation running entirely – depending on the languages – on-device, with a server-side fallback in some contexts (I guess not supported languages).

Running S2ST on the edge is crucial because it:

avoids privacy concerns tied to cloud audio streaming,
reduces latency dramatically,
preserves user experience even with poor connectivity, and
does not cost a penny to be run. Google give this feature for ree.

Why This Matters for Interpreting

The story here is not simply that a new model works. It is that the entire set of assumptions about speech translation has changed. End-to-end models are not “future technology” anymore. They exist, they run in real time, and they run on edge devices thanks to breakthroughs in quantization, codecs, and mobile AI hardware.

And there is another implication: speech-to-speech translation is becoming free, much as written translation became free with Google Translate and DeepL. The quality may not match what can be achieved with composite pipelines, where many more parameters can be controlled and optimized, and it will therefore not be suitable for professional use. But it clearly points to a near future in which every phone effectively contains a built-in interpreter.

In some respects, this is a genuine improvement. Latency is unmatched compared with current cascaded systems—and this, in turn, will likely push cascaded systems to reduce latency even further. The voice-replication effect may also prove decisive for user acceptance, creating a sense of continuity that earlier systems could never offer.

Still, whether people will use it at scale remains an open question. Technology alone does not guarantee adoption. The answer will only become clear in the next few years.

From Cascaded Systems to Direct Audio-to-Audio Translation

The Two-Second Breakthrough

Audio Tokens and Streaming Transformers

How It Fits on a Phone: Quantization and New On-Device AI Chips

Why This Matters for Interpreting

Leave a Reply Cancel reply