[hts-users:01802] Re: F0 modelling
- Subject: [hts-users:01802] Re: F0 modelling
- From: "Javi Palenzuela" <javi.pa.cam@xxxxxxxxxxxxxx>
- Date: Wed, 26 Nov 2008 14:27:47 +0000
- Delivered-to: hts-users@xxxxxxxxxxxxxxx
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=ogJ9XdkfJpW+GeupOxBhWzIP0o8rI6jU8gU6w5nQle0=; b=ECTv/3v+MJH5TgU6rYUWx5lnU9gUQSDbD9RktXSLfX6T5IlI9AFhEaVKXKdG7hArs7 6lbpj2ZR3VbEm+XHEGGs/PNTBGS62lX4xVOeZNFWWFH+KB0pMwcTMKrIoNkl+v6mnSK/ KBqbczzCXTHtUseMd8lMML+Nmv2bJRRoTOwxs=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=hmEqlmawV6xpH5kYqxa/gHKw91xdLsdgqGH0YyqTnbMKEHrU81Il2oVU4MN0X02XFQ sNhTZJM9B4z4b2PI8YKjk4ScNpkDUDHBHJe83sbtmBt6KCAWVKx/XyM0T8bWfT7pDMJC S2RZeZsPtLrxxXO5yAWDj+Xerr7zOBkDjgVhA=
Thanks Heiga for these answers!
Can you point me where does hts get #UV (total occupancy counts of
unvoiced frame) from the state occupancy probability for each frame?
Thanks again
Javi P.
2008/11/26 Matt Shannon <sms46@xxxxxxxxx>:
> Heiga Zen (Byung Ha CHUN) wrote:
>> I think this matter highly depends on the performance of F0 extraction
>> method. If extracted F0 values are noisy and include a lot of
>> voiced/unvoiced errors, voiced/unvoiced transitions at state boundaries
>> often occur. Apparently it degrades the quality of synthesized speech.
>>
>> Some F0 extraction methods make voiced/unvoiced decision of each frame
>> independently. However, various F0 extraction methods such as get_f0
>> uses the Viterbi search and include voiced/unvoiced transition cost. So
>> it tends to reduce voiced/unvoiced transitions in extracted F0s. I
>> personally think that this type of F0 extraction methods suit for
>> HMM-based speech synthesis.
> I agree, if your original F0 extraction has too many voiced/unvoiced
> transitions that can only make things worse at synthesis time.
>
>> I agree with you. It should also be noted that decision tree-based
>> context clustering try to purify the voiced/unvoiced frames associated
>> to clusters:
> And in general the decision tree clustering will tend to cluster states
> that are "similar", for example in their voiced/unvoiced distribution,
> since this increases the overall likelihood. This definitely helps.
>
> But what surprised me most of all was, even assuming our probability of
> voicing for each state is trained perfectly, that we get a decent
> _sequence_ of voiced/unvoiced decisions by making the decision for each
> frame independently. It seems remarkable that works as well as it does
> (and it does work well!)
>
> Thanks,
>
> Matt Shannon
>
>
- References
-
- [hts-users:01797] F0 modelling, Javi Palenzuela
- [hts-users:01798] Re: F0 modelling, Heiga Zen (Byung Ha CHUN)
- [hts-users:01799] Re: F0 modelling, Matt Shannon
- [hts-users:01800] Re: F0 modelling, Heiga Zen (Byung Ha CHUN)
- [hts-users:01801] Re: F0 modelling, Matt Shannon