Hi Nickolay,
Thank you very much for the detailed answer.
Yes, I'm already aware about the problems with schwa in
this particular example -- obviously need to fix it.
So if I understood you correctly the reason could be
that the models for "z" confuse voiced and unvoiced
versions. I am however quite confident that all (or almost
all) places where "z" transforms to "s" in the training
data are correctly reflected in the lexicon. (It was in
fact double-checked -- by a linguist by hand and also
automatically by a discriminatively trained nnet acoustic
model). I will of course follow your advice and do
additional checks, it will never hurt do to so.
I think I can demonstrate quite clearly that the "z"
problem in this case is concerned with mel-cepstral
analysis/synthesis -- I can reproduce the effect without
actual TTS system, just by converting a waveform to
mgc+lf0 and reconstructing it from them. Here I attach two
waveforms with identical text: one read by a real speaker,
one recovered from the first using extracted lf0 and mgc
(with the default parameters of hts-2.3alpha). The "z"
phones exhibit exactly the same "metallic" problem. (You
may find the utterance text familiar :))
Note: I had to downsample the attached files to 16KHz
so that the mailing list doesn't reject the letter due to
its size; the original was 48KHz, and all manipulations
were performed on the 48KHz file to conform with the hts
example setup.
That makes me believe that I am using the
analysis-synthesis system incorrectly. Probably the
settings in hts-2.3alpha are suboptimal? I am using
SPTK-3.6, if it is relevant.
Many thanks again!
Ilya
Hi,
Has anyone faced a problem of "metallic" or "ringing"
sounding of particular consonants in HTS, such as "z"
and "th"?
Here are a couple of examples attached: one English
"this is the zombie" from cmu-slt voice and one Russian
with text "zoya zabrala zebru". In both cases "z" sounds
a bit strange in almost the same way, although the
training databases are disjoint and even the training
systems are different (English is taken from MaryTTS,
Russian is built on hts-2.3alpha).
Training data itself does not contain such an effect
for "z" phones. Postfiltering or altering GV weights
doesn't seem to help. Setting MSD threshold as high as
0.95 does help: there is no more "metallic" sounding for
"z", but vowels are of course distorted into
"whispering".
It seems that the main problem is in the frequency
band 2.5KHz-6KHz. It is visible at the spectrogram
(attached; positions 0.2, 0.5, 1.0). Applying a steep
bandreject filter (almost removing this band) does
help. I wonder if there is a cleaner way to deal with
it? Probably tuning the MLSA filter or playing with
mel-cepstral feature extraction parameters could help?
Thank you for any advice!
Hello Ilya
Wile the problem of sound quality exists, I think that
the issue with your sample is a bit different from the one
you describing.
English sample is actually ok since it was trained more
or less properly. z though a bit fuzzy it's properly
voiced. As for Russian sample, I see few issues in it.
First of all z sound is originally voiced and in many
cases it is reduced to voiceless. In your case in "zoya"
it is definitely voiced sound while if you look on the
spectrogram you'll see that engine tries to make it
voiceless. And synthesizer makes it inconsistently, the
beginning of z is voiced (0.16-0.20) and the ending of it
is voiceless (0.20-0.25). That unusual transition makes
you think you hear an artifact like she is speaking with
closed teeth.
I see other phonetic issues in your synthesis, for
example "zabrala" is incorrectly rendered as "z a b r a l
aa" while it should be closer to "z shwa b r a l aa" .
Russian is pretty flexible language and though many
believe it's read as it is written its very hard to train
it properly. For example, this "z"/"s" reduction must be
properly accounted in a training database, and you need to
debug all the issues in training database markup to get a
properly synthesized sounds. Unlike ASR, synthesis doesn't
allow such mistakes.
Another possible way to improve this situation is to
increase the training data size.