AI Talks Good

Having computers read things and sound like people has long been a challenge. But a Google team seems to have hit parity with human speech (paper):

The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms.

Sure, perfectly clear to me. But they follow with:

Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech

Comparable to professionally recorded speech, that I understand. Tacotron 2 is a single neural network trained from data alone. That seems to be the direction AI is going, neural networks that are easily trained.

Here’s a link to examples. I’ll embed a few:

Medical text that I couldn’t pronounce:

Tongue twister:

And the best part, side by side with a human profession and the AI. Can you tell which is which?


Comments

Leave a comment