[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:04134] Re: Dealing with "metallic" sounding of some consonants in HTS


 
 
02.10.2014, 00:27, "Ilya Edrenkin" <ilia@xxxxxxxxxxxxxxx>:
Hi,
 
Has anyone faced a problem of "metallic" or "ringing" sounding of particular consonants in HTS, such as  "z" and "th"?
 
Here are a couple of examples attached: one English "this is the zombie" from cmu-slt voice and one Russian with text "zoya zabrala zebru". In both cases "z" sounds a bit strange in almost the same way, although the training databases are disjoint and even the training systems are different (English is taken from MaryTTS, Russian is built on hts-2.3alpha).
 
Training data itself does not contain such an effect for "z" phones. Postfiltering or altering GV weights doesn't seem to help. Setting MSD threshold as high as 0.95 does help: there is no more "metallic" sounding for "z", but vowels are of course distorted into "whispering".
 
It seems that the main problem is in the frequency band 2.5KHz-6KHz. It is visible at the spectrogram (attached; positions 0.2, 0.5, 1.0).  Applying a steep bandreject filter (almost removing this band) does help. I wonder if there is a cleaner way to deal with it? Probably tuning the MLSA filter or playing with mel-cepstral feature extraction parameters could help?
 
Thank you for any advice!
 
Hello Ilya
 
Wile the problem of sound quality exists, I think that the issue with your sample is a bit different from the one you describing.
 
English sample is actually ok since it was trained more or less properly. z though a bit fuzzy it's properly voiced. As for Russian sample, I see few issues in it.
 
First of all z sound is originally voiced and in many cases it is reduced to voiceless. In your case in "zoya" it is definitely voiced sound while if you look on the spectrogram you'll see that engine tries to make it voiceless. And synthesizer makes it inconsistently, the beginning of z is voiced (0.16-0.20) and the ending of it is voiceless (0.20-0.25). That unusual transition makes you think you hear an artifact like she is speaking with closed teeth. 
 
I see other phonetic issues in your synthesis, for example "zabrala" is incorrectly rendered as "z a b r a l aa" while it should be closer to "z shwa b r a l aa" .
 
Russian is pretty flexible language and though many believe it's read as it is written its very hard to train it properly. For example, this "z"/"s" reduction must be properly accounted in a training database, and you need to debug all the issues in training database markup to get a properly synthesized sounds. Unlike ASR, synthesis doesn't allow such mistakes.
 
Another possible way to improve this situation is to increase the training data size.
 
 

Follow-Ups
[hts-users:04135] Re: Dealing with "metallic" sounding of some consonants in HTS, Ilya Edrenkin
References
[hts-users:04132] Dealing with "metallic" sounding of some consonants in HTS, Ilya Edrenkin