Last year, Google showed off WaveNet, a new way of generating speech that didn’t rely on a bulky library of word bits or cheap shortcuts that result in stilted speech. WaveNet used machine learning to build a voice sample by sample, and the results were, as I put it then, “eerily convincing.” Previously bound to the lab, the tech has now been deployed in the latest version of Google Assistant.
The general idea behind the tech was to recreate words and sentences not by coding grammatical and tonal rules manually, but allowing a machine learning system to see those patterns in speech and generate them sample by sample. A sample, in this case, being the tone generated every 1/16,000th of a second.
At the time of its first release, WaveNet was extremely computationally expensive, taking a full second to generate 0.02 seconds of sound — so a two-second clip like “turn right at Cedar street” would take nearly two minutes to generate. As such, it was poorly suited to actual use (you’d have missed your turn by then) — which is why Google engineers set about improving it.
The new, improved WaveNet generates sound at 20x real time — generating the same two-second clip in a tenth of a second. And it even creates sound at a higher sample rate: 24,000 samples per second, and at 16 versus 8 bits. Not that high-fidelity sound can really be appreciated in a smartphone speaker, but given today’s announcements, we can expect Assistant to appear in many more places soon.
The voices generated by WaveNet sound considerably better than the state of the art concatenative systems used previously:
Old and busted:
New and hot:
Disrupt 2026: The tech ecosystem, all in one room
Your next round. Your next hire. Your next breakout opportunity. Find it at TechCrunch Disrupt 2026, where 10,000+ founders, investors, and tech leaders gather for three days of 250+ tactical sessions, powerful introductions, and market-defining innovation. Register now to save up to $400.
Save up to $300 or 30% to TechCrunch Founder Summit
1,000+ founders and investors come together at TechCrunch Founder Summit 2026 for a full day focused on growth, execution, and real-world scaling. Learn from founders and investors who have shaped the industry. Connect with peers navigating similar growth stages. Walk away with tactics you can apply immediately
Offer ends March 13.
(More samples are available at the Deep Mind blog post, though presumably the Assistant will also sound like this soon.)
WaveNet also has the admirable quality of being extremely easy to scale to other languages and accents. If you want it to speak with a Welsh accent, there’s no need to go in and fiddle with the vowel sounds yourself. Just give it a couple dozen hours of a Welsh person speaking and it’ll pick up the nuances itself. That said, the new voice is only available for U.S. English and Japanese right now, with no word on other languages yet.
In keeping with the trend of “big tech companies doing what the other big tech companies are doing,” Apple, too, recently revamped its assistant (Siri, don’t you know) with a machine learning-powered speech model. That one’s different, though: it didn’t go so deep into the sound as to recreate it at the sample level, but stopped at the (still quite low) level of half-phones, or fractions of a phoneme.
The team behind WaveNet plans to publish its work publicly soon, but for now you’ll have to be satisfied with their promises that it works and performs much better than before.
