[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:03311] Re: How to improve performance of TTS without manual phoneme alignment

Subject: [hts-users:03311] Re: How to improve performance of TTS without manual phoneme alignment
From: Kwan Lisa <lisakwan1102@xxxxxxxxx>
Date: Fri, 25 May 2012 12:27:18 +0800
Delivered-to: hts-users@xxxxxxxxxxxxxxx
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=eApjksd+OgmlPjmrntmX5AnvSsu9ZRZbsQg91Nhb6Ig=; b=0kVzQ5692zk4leCs3xgD2mCMAty+yz4ei9pjCkiM6MfXKlphniKhZ2AaPI13sJvmdS PGuuR9YvE2q8NmCdVwvONFIN7q3LBxaQpfPg59Ryg5AQl+3NuaherRgduAiPENSilAcB wSjRAQWsqZZoNT9OlmXst0/PhajNfZiHhi6RLpZ7wzpDZLfCrxCF5IYyHqFwAXSi4Y/F QxPOwfrTw7tLGo8uFrJk5+4fGBwg161yEgZneSGJQgyAZvWJewcvCFcPrDBPw7alhRmx RA+QqxygkQuVkD7TADxZ899FoR49Nm6fPDSmtIxB6adLP7/rpIjHCQRBc7Qwdi287sa0 G8tg==

Hi,

I conducted several experiments and there was little difference
between the adaptation performance of average voice models of 3000
sentences and 6000 sentences, so I thought it's the upper bound of
this scheme. Was 6000 still too small for the training average model?
Besides some paper indicates that it is that the shared tree
clustering is the critical technique in training good average model.
As far as I know it is not implemented in HTS.

Lisa

2012/5/24 Junichi Yamagishi <jyamagis@xxxxxxxxxxxx>:
> Hi,
>
> I cannot provide you my latest receipts yet (because of various reasons), but, I will give you my old scripts that I had used for the EMIME project. Please find the attachment. You will be able to see my receipts (that I used at that time, 2008 or something) to train the average voice models in a similar conditions (284 speakers x 150 sentences). This will help you to improve the performance/quality of your average voice models.
>
> Please note that they are old scripts. The options may change in the latest HTS ver 2.2.
>
> I think that 200 sentences per speaker would be enough, but you would need to increase the amount of training data of average voice models more (3,000 sentences are not enough to train a good average voice models in my experience)
>
> Regards,
> Junichi
>
>
>
>
>
> On 13 May 2012, at 11:05, Kwan Lisa wrote:
>
>> Maybe it was the diversity of my corpus resulted in the poor
>> performance. My corpus consists of voices of 300 speakers and about
>> 200 sentences (about 10 mins) for each speaker. This corpus is not
>> labeled manually. I built a average voice model using about 3000
>> sentences, then I used sentences of one speaker to perform adaptation.
>> At the first place I thought the alignment accuracy was the reason why
>> performance was bad. But I think you are right. Embedded training
>> procedure doesn't rely on the phoneme boundary. Now I'm not sure what
>> kind of method could improve the performance of the TTS system. I'm
>> afraid that the number of the adaptation sentence is too small;
>> however I only have 200 sentences for each speaker.
>>
>> 2012/5/13 huanliangwang <huanliangwang@xxxxxxx>:
>>> Hi,
>>>    Our experiments show the same conclusion as you. The effect of initial
>>> phone boundary to the synthesis results is very limited if you use the
>>> embedded training procedure. I think the consistency between mono phone
>>> sequence and  full-context phone sequence, as well as in training stage and
>>> in synthesis stage,  may be more important for synthesis performance.
>>>
>>> Best Regards,
>>>
>>> hlwang
>>>
>>>
>>>
>>> At 2012-05-13 10:05:26,"那兴宇" <nxy-yzqs@xxxxxxx> wrote:
>>>
>>> My opinion:
>>> HTS system use phoneme boundaries only in the initialization stage of
>>> monophone models, which will affect the result of convergences in the
>>> embedded training of full-context models. But in my experiments, phoneme
>>> alignment does not affect that much. So I do not know what is your corpus
>>> size, maybe using more training data would help.
>>> --
>>> Xingyu Na (那兴宇)
>>> Beijing Institute of Technology
>>> naxy(at)bit.edu.cn
>>> asr.naxingyu(at)gmail.com
>>> naxingyu at {facebook, twitter, linkedin}
>>>
>>>
>>> At 2012-05-13 03:29:51,"Kwan Lisa" <lisakwan1102@xxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> I'm using a corpus without manual phoneme alignment. Thus, I performed
>>>> forced alignment to get the phoneme boundary information.
>>>> However, the performance of the TTS system was not good. TTS system
>>>> seems to be very sensitive to the accuracy of the phoneme boundary
>>>> information.
>>>> Is there any method that could improve the performance of TTS without
>>>> manual phoneme alignment?
>>>>
>>>> --
>>>> Lisa Kwan
>>>> lisakwan1102(at)gmail.com
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Lisa Kwan
>> lisakwan1102(at)gmail.com
>>
>>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>



-- 
Lisa Kwan
lisakwan1102(at)gmail.com
Advanced Speech Technology Lab, ASTL

Follow-Ups
: [hts-users:03312] Re: How to improve performance of TTS without manual phoneme alignment, 那兴宇

References
: [hts-users:03295] How to improve performance of TTS without manual phoneme alignment, Kwan Lisa; [hts-users:03299] Re: How to improve performance of TTS without manual phoneme alignment, 那兴宇; [hts-users:03300] Re: How to improve performance of TTS without manual phoneme alignment, huanliangwang; [hts-users:03301] Re: How to improve performance of TTS without manual phoneme alignment, Kwan Lisa; [hts-users:03310] Re: How to improve performance of TTS without manual phoneme alignment, Junichi Yamagishi

Prev by Subject: [hts-users:03310] Re: How to improve performance of TTS without manual phoneme alignment
Next by Subject: [hts-users:03312] Re: How to improve performance of TTS without manual phoneme alignment
Previous by thread: [hts-users:03310] Re: How to improve performance of TTS without manual phoneme alignment
Next by thread: [hts-users:03312] Re: How to improve performance of TTS without manual phoneme alignment