[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:01800] Re: F0 modelling

Subject: [hts-users:01800] Re: F0 modelling
From: "Heiga Zen (Byung Ha CHUN)" <heiga.zen@xxxxxxxxxxxxxxxxx>
Date: Wed, 26 Nov 2008 12:09:47 +0000
Delivered-to: hts-users@xxxxxxxxxxxxxxx

Hi,

Matt Shannon wrote:

That's a very clear, concise explanation, Heiga. I've wondered aboutthis in the past too -- it seems amazing to me that you don't get fartoo many voiced/unvoiced transitions.

I think this matter highly depends on the performance of F0 extractionmethod. If extracted F0 values are noisy and include a lot ofvoiced/unvoiced errors, voiced/unvoiced transitions at state boundariesoften occur. Apparently it degrades the quality of synthesized speech.

Some F0 extraction methods make voiced/unvoiced decision of each frameindependently. However, various F0 extraction methods such as get_f0uses the Viterbi search and include voiced/unvoiced transition cost. Soit tends to reduce voiced/unvoiced transitions in extracted F0s. Ipersonally think that this type of F0 extraction methods suit forHMM-based speech synthesis.

In effect, you're making the voiced/unvoiced decision independently foreach frame. Since you're maximizing, the resulting voiced/unvoiceddecision is at least constant in each state, but I'm still surprised itwe don't get far too many voiced/unvoiced transitions at stateboundaries. I guess there must be enough information in thefull-context models to make a very reliable voiced/unvoiced decision,independent of the voiced/unvoicedness of neighbouring frames/states?

I agree with you. It should also be noted that decision tree-basedcontext clustering try to purify the voiced/unvoiced frames associatedto clusters:


HHEd.c:AccSumProb()

	prob+=weight*(x+occ*log(occ/acc->occ));

In the above statement, occ*log(occ/acc->occ) corresponds to thelikelihood of voiced/unvoiced weights. acc->occ is #V+#UV, and occ is #V(if the current distribution is for voiced) or #UV (if the currentdistribution is for unvoiced).


If #V+#UV=200 & #V=#UV=100 (weight_uv=weight_v=0.5),

 #V*ln(#V/(#V+#UV)) + #UV*ln(#UV/(#V+#UV)) = -138.02...

However, if #V=199 and #UV=1 (weight_v=0.995, weight_uv=0.005)

 #V*ln(#V/(#V+#UV)) + #UV*ln(#UV/(#V+#UV)) = -5.29...

Apparently, the latter one gives the higher likelihood, so the questionwhich gives the latter one will be chosen. So tree-based clusteringtends to purify the voiced/unvoiced frames associated to states.


Best regards,

Heiga ZEN (Byung Ha CHUN)

--
--------------------------
Heiga ZEN (Byung Ha CHUN)
Speech Technology Group
Cambridge Research Lab
Toshiba Research Europe
phone: +44 1223 436975

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.

For more information please visit http://www.messagelabs.com/email______________________________________________________________________

Follow-Ups
: [hts-users:01801] Re: F0 modelling, Matt Shannon

References
: [hts-users:01797] F0 modelling, Javi Palenzuela; [hts-users:01798] Re: F0 modelling, Heiga Zen (Byung Ha CHUN); [hts-users:01799] Re: F0 modelling, Matt Shannon

Prev by Subject: [hts-users:01799] Re: F0 modelling
Next by Subject: [hts-users:01801] Re: F0 modelling
Previous by thread: [hts-users:01799] Re: F0 modelling
Next by thread: [hts-users:01801] Re: F0 modelling