[hts-users:01800] Re: F0 modelling
Hi,
Matt Shannon wrote:
That's a very clear, concise explanation, Heiga. I've wondered about
this in the past too -- it seems amazing to me that you don't get far
too many voiced/unvoiced transitions.
I think this matter highly depends on the performance of F0 extraction
method. If extracted F0 values are noisy and include a lot of
voiced/unvoiced errors, voiced/unvoiced transitions at state boundaries
often occur. Apparently it degrades the quality of synthesized speech.
Some F0 extraction methods make voiced/unvoiced decision of each frame
independently. However, various F0 extraction methods such as get_f0
uses the Viterbi search and include voiced/unvoiced transition cost. So
it tends to reduce voiced/unvoiced transitions in extracted F0s. I
personally think that this type of F0 extraction methods suit for
HMM-based speech synthesis.
In effect, you're making the voiced/unvoiced decision independently for
each frame. Since you're maximizing, the resulting voiced/unvoiced
decision is at least constant in each state, but I'm still surprised it
we don't get far too many voiced/unvoiced transitions at state
boundaries. I guess there must be enough information in the
full-context models to make a very reliable voiced/unvoiced decision,
independent of the voiced/unvoicedness of neighbouring frames/states?
I agree with you. It should also be noted that decision tree-based
context clustering try to purify the voiced/unvoiced frames associated
to clusters:
HHEd.c:AccSumProb()
prob+=weight*(x+occ*log(occ/acc->occ));
In the above statement, occ*log(occ/acc->occ) corresponds to the
likelihood of voiced/unvoiced weights. acc->occ is #V+#UV, and occ is #V
(if the current distribution is for voiced) or #UV (if the current
distribution is for unvoiced).
If #V+#UV=200 & #V=#UV=100 (weight_uv=weight_v=0.5),
#V*ln(#V/(#V+#UV)) + #UV*ln(#UV/(#V+#UV)) = -138.02...
However, if #V=199 and #UV=1 (weight_v=0.995, weight_uv=0.005)
#V*ln(#V/(#V+#UV)) + #UV*ln(#UV/(#V+#UV)) = -5.29...
Apparently, the latter one gives the higher likelihood, so the question
which gives the latter one will be chosen. So tree-based clustering
tends to purify the voiced/unvoiced frames associated to states.
Best regards,
Heiga ZEN (Byung Ha CHUN)
--
--------------------------
Heiga ZEN (Byung Ha CHUN)
Speech Technology Group
Cambridge Research Lab
Toshiba Research Europe
phone: +44 1223 436975
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
- Follow-Ups
-
- [hts-users:01801] Re: F0 modelling, Matt Shannon
- References
-
- [hts-users:01797] F0 modelling, Javi Palenzuela
- [hts-users:01798] Re: F0 modelling, Heiga Zen (Byung Ha CHUN)
- [hts-users:01799] Re: F0 modelling, Matt Shannon