[hts-users:03168] Questions from a novice about improving voice quality
- Subject: [hts-users:03168] Questions from a novice about improving voice quality
- From: Matt Campbell <mattcampbell@xxxxxxxxx>
- Date: Wed, 15 Feb 2012 08:52:53 -0600
- Delivered-to: hts-users@xxxxxxxxxxxxxxx
- Dkim-signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=mime-version :content-type:content-transfer-encoding:date:from:to:subject :message-id; s=sasl; bh=QvuxK4XKZIHG5yFR7iCuuzbqqdQ=; b=KTeDgvy9 lNh0r+oT+1gU7sBJ+0ekwngNxjC7m6tAFTUYhstW2c4b4jCJuz6f2OXHyDPpIvd1 W9EwcIF+y6Uvr2anBh6mJH+8V8iFB9OZF7YpkpIQ7+fg7Lk641DLmGBIhvUAzzJk ieRT8nYIBRRXUBDB5qEG2xf18XkpHWYFaFA=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=pobox.com; h=mime-version :content-type:content-transfer-encoding:date:from:to:subject :message-id; q=dns; s=sasl; b=d8yD35mEY2/EEDjE/CFnI5ozDMgUJXRs8/ 4/NwxoM8o4lFcnVZDGbSgDVs94xZDrdpu08BuMEOFrjZg2gExvM+hQZRvp9/i4bP mYxizugqsV/KNfuD9dB4gjF7W8mFLPygXnFHR6juHOdGWyHvyX71EevEn6gEKa82 24DnmWqFE=
I've just started learning about HMM-based speech synthesis, and I find
it quite interesting. It seems to me that HMM-based synthesis is more
flexible than unit selection, and produces a more consistent quality of
output. In particular, HMM-based synthesis seems to avoid the glitches
that are common in unit selection when the input text doesn't overlap
much with the pre-recorded units. So HMM-based synthesis seems like a
return to the strengths of formant-based synthesis. But it seems to me
that unit selection still does a much better job of reproducing the
unique sound of the original speaker's voice.
I noticed that the CMU ARCTIC SLT demo voice is quite small, ~2 MB
uncompressed. And as one might expect for such a small voice definition,
it doesn't faithfully reproduce the sound of the woman's voice in the
original recordings. Indeed, it sounds much like the small-footprint
voices that I've heard on mobile devices, e.g. the compact US English
female voice on the iPhone (Samantha) or the US English female voice on
Android (SVOX Pico).
So, can a larger HMM voice be built, that more accurately reproduces
the voice in the original recordings? For example, would it be possible
to produce a larger mgc.pdf file to achieve this goal? In the case of
the CMU ARCTIC SLT demo package, would increasing the MGCORDER option
help? I ask about the MGC data because mgc.pdf is by far the largest
file in the binary package, suggesting that it contains the most
information about the sound of the voice.
If not the MGC order, then are there any other numbers that can be
tuned to improve voice quality, potentially at the expense of a larger
model or longer training time?
Or is this a fundamental limitation of HMM-based synthesis?
Even if the latter is true, I would probably accept the SLT voice in
day-to-day use; after all, my current favorite synthesizer is a formant
synthesizer. I just want to know more about what's possible with