[hts-users:04138] Re: Dealing with "metallic" sounding of some consonant

[hts-users:04138] Re: Dealing with "metallic" sounding of some consonants in HTS

Subject: [hts-users:04138] Re: Dealing with "metallic" sounding of some consonants in HTS

From: Csapo Tamas Gabor <csapot@xxxxxxxxxxx>

Date: Tue, 07 Oct 2014 22:04:25 +0200

Delivered-to: hts-users@xxxxxxxxxxxxxxx

Hi Ilya,

Another problem with the HTS-demo pulse-noise excitation is that it has a fixed maximum voiced frequency (MVF) of 6 kHz - below this the excitation is voiced, and above it unvoiced.

There is a recent paper showing that using a dynamically changing maximum voiced frequency can be beneficial:

T.Drugman, T. Raitio, Excitation Modeling for HMM-based Speech Synthesis: Breaking Down the Impact of Periodic and Aperiodic Components, ICASSP 2014,
http://tcts.fpms.ac.be/~drugman/files/ICASSP14_Excitation.pdf

Best,
Tamas

2014.10.07. 21:56 keltezéssel, Ilya Edrenkin írta:

Hi Nickolay,

Thanks for the reference -- this article was exactly what I needed.

Ilya

02.10.2014, 23:04, "Nickolay V.Shmyrev" <nshmyrev@xxxxxxxxx>:

Well, I believe that is exactly the "buzzyness" problem solved with mixed excitation. With plain full-band excitation it still tries to create voiced parts in the middle of the spectrum around 2-4khz. You can compare spectrums at 2.97 - 3.03 with Wavesurfer.

I'm not sure what exactly HTS config are you using, but Openmary which has mixed excitation by default should get it way better in different bands. That's why I preferred your English sample more.

You can also check

http://www.cstr.ed.ac.uk/downloads/publications/2013/ssw8_OS4-3_Hu.pdf

02.10.2014, 23:51, "Ilya Edrenkin" <ilia@xxxxxxxxxxxxxxx>:

Hi Nickolay,

Thank you very much for the detailed answer.

Yes, I'm already aware about the problems with schwa in this particular example -- obviously need to fix it.

So if I understood you correctly the reason could be that the models for "z" confuse voiced and unvoiced versions. I am however quite confident that all (or almost all) places where "z" transforms to "s" in the training data are correctly reflected in the lexicon. (It was in fact double-checked -- by a linguist by hand and also automatically by a discriminatively trained nnet acoustic model). I will of course follow your advice and do additional checks, it will never hurt do to so.

I think I can demonstrate quite clearly that the "z" problem in this case is concerned with mel-cepstral analysis/synthesis -- I can reproduce the effect without actual TTS system, just by converting a waveform to mgc+lf0 and reconstructing it from them. Here I attach two waveforms with identical text: one read by a real speaker, one recovered from the first using extracted lf0 and mgc (with the default parameters of hts-2.3alpha). The "z" phones exhibit exactly the same "metallic" problem. (You may find the utterance text familiar :))

Note: I had to downsample the attached files to 16KHz so that the mailing list doesn't reject the letter due to its size; the original was 48KHz, and all manipulations were performed on the 48KHz file to conform with the hts example setup.

That makes me believe that I am using the analysis-synthesis system incorrectly. Probably the settings in hts-2.3alpha are suboptimal? I am using SPTK-3.6, if it is relevant.

Many thanks again!

Ilya

02.10.2014, 20:10, "Nickolay V.Shmyrev" <nshmyrev@xxxxxxxxx>:

02.10.2014, 00:27, "Ilya Edrenkin" <ilia@xxxxxxxxxxxxxxx>:

Hi,

Has anyone faced a problem of "metallic" or "ringing" sounding of particular consonants in HTS, such as "z" and "th"?

Here are a couple of examples attached: one English "this is the zombie" from cmu-slt voice and one Russian with text "zoya zabrala zebru". In both cases "z" sounds a bit strange in almost the same way, although the training databases are disjoint and even the training systems are different (English is taken from MaryTTS, Russian is built on hts-2.3alpha).

Training data itself does not contain such an effect for "z" phones. Postfiltering or altering GV weights doesn't seem to help. Setting MSD threshold as high as 0.95 does help: there is no more "metallic" sounding for "z", but vowels are of course distorted into "whispering".

It seems that the main problem is in the frequency band 2.5KHz-6KHz. It is visible at the spectrogram (attached; positions 0.2, 0.5, 1.0). Applying a steep bandreject filter (almost removing this band) does help. I wonder if there is a cleaner way to deal with it? Probably tuning the MLSA filter or playing with mel-cepstral feature extraction parameters could help?

Thank you for any advice!

Hello Ilya

Wile the problem of sound quality exists, I think that the issue with your sample is a bit different from the one you describing.

English sample is actually ok since it was trained more or less properly. z though a bit fuzzy it's properly voiced. As for Russian sample, I see few issues in it.

First of all z sound is originally voiced and in many cases it is reduced to voiceless. In your case in "zoya" it is definitely voiced sound while if you look on the spectrogram you'll see that engine tries to make it voiceless. And synthesizer makes it inconsistently, the beginning of z is voiced (0.16-0.20) and the ending of it is voiceless (0.20-0.25). That unusual transition makes you think you hear an artifact like she is speaking with closed teeth.

I see other phonetic issues in your synthesis, for example "zabrala" is incorrectly rendered as "z a b r a l aa" while it should be closer to "z shwa b r a l aa" .

Russian is pretty flexible language and though many believe it's read as it is written its very hard to train it properly. For example, this "z"/"s" reduction must be properly accounted in a training database, and you need to debug all the issues in training database markup to get a properly synthesized sounds. Unlike ASR, synthesis doesn't allow such mistakes.

Another possible way to improve this situation is to increase the training data size.