[hts-users:03301] Re: How to improve performance of TTS without manual phoneme alignment
- Subject: [hts-users:03301] Re: How to improve performance of TTS without manual phoneme alignment
- From: Kwan Lisa <lisakwan1102@xxxxxxxxx>
- Date: Sun, 13 May 2012 18:05:13 +0800
- Delivered-to: hts-users@xxxxxxxxxxxxxxx
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=pNsQSY/WDkAPh4d1YFO2g2uvpKw0qP67Pfj3aDGxUW4=; b=OshExSd8bEGKkysWpOJjmT0JpYiI05KuWyHXRSk/rILSCzG19WZQ/goxzXD3qdaEMd ODB8sdRww2ExKlBeUmhOqLKc9CAK/Isv2xxEoLa2uYDbhbc697Sbqpd+vaowcwHi/C2K nZUwNXL24v+WOsfGzoLgBoi8I3tkyurtNzAFZcsGHgHYyFfAAg/4lI2FnVjF7u/61vni xEvmMf5KRA1IhN8PSl0kZLlWOk4yQVP/4qPuCGikgZ2c9OCawazQGxw74wxIlryIT9kg Gvgylk5mzJj2CZxSa0NJf+mixqtehmy9vVGeMHoXZncv+2SwvzDZ7fgJn6yeqpXxhgSg VNvg==
Maybe it was the diversity of my corpus resulted in the poor
performance. My corpus consists of voices of 300 speakers and about
200 sentences (about 10 mins) for each speaker. This corpus is not
labeled manually. I built a average voice model using about 3000
sentences, then I used sentences of one speaker to perform adaptation.
At the first place I thought the alignment accuracy was the reason why
performance was bad. But I think you are right. Embedded training
procedure doesn't rely on the phoneme boundary. Now I'm not sure what
kind of method could improve the performance of the TTS system. I'm
afraid that the number of the adaptation sentence is too small;
however I only have 200 sentences for each speaker.
2012/5/13 huanliangwang <huanliangwang@xxxxxxx>:
> Hi,
> Our experiments show the same conclusion as you. The effect of initial
> phone boundary to the synthesis results is very limited if you use the
> embedded training procedure. I think the consistency between mono phone
> sequence and full-context phone sequence, as well as in training stage and
> in synthesis stage, may be more important for synthesis performance.
>
> Best Regards,
>
> hlwang
>
>
>
> At 2012-05-13 10:05:26,"那兴宇" <nxy-yzqs@xxxxxxx> wrote:
>
> My opinion:
> HTS system use phoneme boundaries only in the initialization stage of
> monophone models, which will affect the result of convergences in the
> embedded training of full-context models. But in my experiments, phoneme
> alignment does not affect that much. So I do not know what is your corpus
> size, maybe using more training data would help.
> --
> Xingyu Na (那兴宇)
> Beijing Institute of Technology
> naxy(at)bit.edu.cn
> asr.naxingyu(at)gmail.com
> naxingyu at {facebook, twitter, linkedin}
>
>
> At 2012-05-13 03:29:51,"Kwan Lisa" <lisakwan1102@xxxxxxxxx> wrote:
>>Hi,
>>
>>I'm using a corpus without manual phoneme alignment. Thus, I performed
>>forced alignment to get the phoneme boundary information.
>>However, the performance of the TTS system was not good. TTS system
>>seems to be very sensitive to the accuracy of the phoneme boundary
>>information.
>>Is there any method that could improve the performance of TTS without
>>manual phoneme alignment?
>>
>>--
>>Lisa Kwan
>>lisakwan1102(at)gmail.com
>>
>
>
>
>
>
--
Lisa Kwan
lisakwan1102(at)gmail.com
- Follow-Ups
-
- [hts-users:03310] Re: How to improve performance of TTS without manual phoneme alignment, Junichi Yamagishi
- References
-
- [hts-users:03295] How to improve performance of TTS without manual phoneme alignment, Kwan Lisa
- [hts-users:03299] Re: How to improve performance of TTS without manual phoneme alignment, 那兴宇
- [hts-users:03300] Re: How to improve performance of TTS without manual phoneme alignment, huanliangwang