[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:01800] Re: F0 modelling


Hi,

Matt Shannon wrote:

That's a very clear, concise explanation, Heiga. I've wondered about this in the past too -- it seems amazing to me that you don't get far too many voiced/unvoiced transitions.

I think this matter highly depends on the performance of F0 extraction method. If extracted F0 values are noisy and include a lot of voiced/unvoiced errors, voiced/unvoiced transitions at state boundaries often occur. Apparently it degrades the quality of synthesized speech.

Some F0 extraction methods make voiced/unvoiced decision of each frame independently. However, various F0 extraction methods such as get_f0 uses the Viterbi search and include voiced/unvoiced transition cost. So it tends to reduce voiced/unvoiced transitions in extracted F0s. I personally think that this type of F0 extraction methods suit for HMM-based speech synthesis.

In effect, you're making the voiced/unvoiced decision independently for each frame. Since you're maximizing, the resulting voiced/unvoiced decision is at least constant in each state, but I'm still surprised it we don't get far too many voiced/unvoiced transitions at state boundaries. I guess there must be enough information in the full-context models to make a very reliable voiced/unvoiced decision, independent of the voiced/unvoicedness of neighbouring frames/states?

I agree with you. It should also be noted that decision tree-based context clustering try to purify the voiced/unvoiced frames associated to clusters:

HHEd.c:AccSumProb()

	prob+=weight*(x+occ*log(occ/acc->occ));

In the above statement, occ*log(occ/acc->occ) corresponds to the likelihood of voiced/unvoiced weights. acc->occ is #V+#UV, and occ is #V (if the current distribution is for voiced) or #UV (if the current distribution is for unvoiced).

If #V+#UV=200 & #V=#UV=100 (weight_uv=weight_v=0.5),

 #V*ln(#V/(#V+#UV)) + #UV*ln(#UV/(#V+#UV)) = -138.02...

However, if #V=199 and #UV=1 (weight_v=0.995, weight_uv=0.005)

 #V*ln(#V/(#V+#UV)) + #UV*ln(#UV/(#V+#UV)) = -5.29...

Apparently, the latter one gives the higher likelihood, so the question which gives the latter one will be chosen. So tree-based clustering tends to purify the voiced/unvoiced frames associated to states.

Best regards,

Heiga ZEN (Byung Ha CHUN)

--
--------------------------
Heiga ZEN (Byung Ha CHUN)
Speech Technology Group
Cambridge Research Lab
Toshiba Research Europe
phone: +44 1223 436975

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email ______________________________________________________________________
Follow-Ups
[hts-users:01801] Re: F0 modelling, Matt Shannon
References
[hts-users:01797] F0 modelling, Javi Palenzuela
[hts-users:01798] Re: F0 modelling, Heiga Zen (Byung Ha CHUN)
[hts-users:01799] Re: F0 modelling, Matt Shannon