[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:03168] Questions from a novice about improving voice quality


I've just started learning about HMM-based speech synthesis, and I find it quite interesting. It seems to me that HMM-based synthesis is more flexible than unit selection, and produces a more consistent quality of output. In particular, HMM-based synthesis seems to avoid the glitches that are common in unit selection when the input text doesn't overlap much with the pre-recorded units. So HMM-based synthesis seems like a return to the strengths of formant-based synthesis. But it seems to me that unit selection still does a much better job of reproducing the unique sound of the original speaker's voice.

I noticed that the CMU ARCTIC SLT demo voice is quite small, ~2 MB uncompressed. And as one might expect for such a small voice definition, it doesn't faithfully reproduce the sound of the woman's voice in the original recordings. Indeed, it sounds much like the small-footprint voices that I've heard on mobile devices, e.g. the compact US English female voice on the iPhone (Samantha) or the US English female voice on Android (SVOX Pico).

So, can a larger HMM voice be built, that more accurately reproduces the voice in the original recordings? For example, would it be possible to produce a larger mgc.pdf file to achieve this goal? In the case of the CMU ARCTIC SLT demo package, would increasing the MGCORDER option help? I ask about the MGC data because mgc.pdf is by far the largest file in the binary package, suggesting that it contains the most information about the sound of the voice.

If not the MGC order, then are there any other numbers that can be tuned to improve voice quality, potentially at the expense of a larger model or longer training time?

Or is this a fundamental limitation of HMM-based synthesis?

Even if the latter is true, I would probably accept the SLT voice in day-to-day use; after all, my current favorite synthesizer is a formant synthesizer. I just want to know more about what's possible with HMM-based synthesis.