Having computers read things and sound like people has long been a challenge. But a Google team seems to have hit parity with human speech (paper):
The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms.
Sure, perfectly clear to me. But they follow with:
Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech
Comparable to professionally recorded speech, that I understand. Tacotron 2 is a single neural network trained from data alone. That seems to be the direction AI is going, neural networks that are easily trained.
Here’s a link to examples. I’ll embed a few:
Medical text that I couldn’t pronounce:
Tongue twister:
And the best part, side by side with a human profession and the AI. Can you tell which is which?
Leave a comment