[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:01801] Re: F0 modelling


Heiga Zen (Byung Ha CHUN) wrote:
> I think this matter highly depends on the performance of F0 extraction
> method.  If extracted F0 values are noisy and include a lot of
> voiced/unvoiced errors, voiced/unvoiced transitions at state boundaries
> often occur.  Apparently it degrades the quality of synthesized speech.
> 
> Some F0 extraction methods make voiced/unvoiced decision of each frame
> independently.  However, various F0 extraction methods such as get_f0
> uses the Viterbi search and include voiced/unvoiced transition cost.  So
> it tends to reduce voiced/unvoiced transitions in extracted F0s.  I
> personally think that this type of F0 extraction methods suit for
> HMM-based speech synthesis.
I agree, if your original F0 extraction has too many voiced/unvoiced
transitions that can only make things worse at synthesis time.

> I agree with you.  It should also be noted that decision tree-based
> context clustering try to purify the voiced/unvoiced frames associated
> to clusters:
And in general the decision tree clustering will tend to cluster states
that are "similar", for example in their voiced/unvoiced distribution,
since this increases the overall likelihood.  This definitely helps.

But what surprised me most of all was, even assuming our probability of
voicing for each state is trained perfectly, that we get a decent
_sequence_ of voiced/unvoiced decisions by making the decision for each
frame independently.  It seems remarkable that works as well as it does
(and it does work well!)

Thanks,

Matt Shannon

Follow-Ups
[hts-users:01802] Re: F0 modelling, Javi Palenzuela
References
[hts-users:01797] F0 modelling, Javi Palenzuela
[hts-users:01798] Re: F0 modelling, Heiga Zen (Byung Ha CHUN)
[hts-users:01799] Re: F0 modelling, Matt Shannon
[hts-users:01800] Re: F0 modelling, Heiga Zen (Byung Ha CHUN)