[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:00205] Re: emotional speech synthesis using hts


Hi, Nicholas Volk,
Thank you very much.

Do you mean that in "festival+HTS",
festival works only as a text analyzer,
F0 is calculated by hts itself, and certainly we can write our festival intonation modules,
but it will not work with the F0 calculated by hts well.

another question:
in the  demo of "hts voice for fetival",
we can modify synthesis parameter by modifying some variables.
Does that realize by festival intonation modules.

regards
liulei


From: "Nicholas Volk" <nvolk@xxxxxxxxxx>
Reply-To: hts-users@xxxxxxxxxxxxxxxxxxxxxxxxx
To: hts-users@xxxxxxxxxxxxxxxxxxxxxxxxx
Subject: [hts-users:00204] Re: emotional speech synthesis using hts
Date: Fri, 24 Feb 2006 10:13:10 +0200 (EET)

Hi,

CART stands for classification and regression tree,
a common data struct in Festival.

Festival speech synthesis system is highly modular, so there's
no one correct intonation module there. Also, for different synthesis
techniques a different kind of approach is needed. For example, in diphone
synthesis even unvoiced sections have computational F0 value to
calculate the appropriate (Festival) frame size. (In voiced sections frame
corresponds with a pitch period.)
Notice that HTS calculates F0 typically every 5 ms while the native
Festival intonation modules typically calculate F0 only at syllable
start, mid point and end. (In diphone syntesis it's a very bad idea
to have two F0 values 5ms from each other. Effectively it would
put two F0 values within one pitch period, which will yield unexpected
and unwanted results.)

One can write her own intonation module in Festival.
However, the current implementation of Festival+HTS
does not (AFAIK) support using an external F0 module.
It's probably a non-trivial though not too hard a task to write
support for external F0 (and/or duration) modulels.
It would mean that dur/f0/mcep would have to be more modular.
You'd have to first calculate the segment durations in hts_engine.c
then calculate the F0 elsewhere (for example in TILT intonation module),
then make the voiced/unvoiced decisions for each (HTS) frame.

Essentially a considerable rewrite is needed in hts_engine.c and mlpg.c
files for using external F0 modules.

regards,
  Nicholas



> Hi,杨鸿武
>
> I can't understand the "CART",
> What do you mean?
>
> And do you think it is feasible that synthesizing emotinal speech
through
> prosody model in HTS.
>
> thank you
> liulei
> 2006.2.24
>
>>From: 杨鸿武 <yang-hw03@xxxxxxxxxxxxxxxxxxxxx>
>>Reply-To: hts-users@xxxxxxxxxxxxxxxxxxxxxxxxx
>>To: <hts-users@xxxxxxxxxxxxxxxxxxxxxxxxx>
>>Subject: [hts-users:00202] 答复: [hts-users:00201] Re: emotional speech
> synthesis using hts
>>Date: Fri, 24 Feb 2006 09:29:57 +0800
>>
>>Hi,
>>I believe that PSOLA based speech synthesis is very different from HTS.
I
>>think the CART itself is the prosody model in HTS.
>>
>>-----邮件原件-----
>>发件人: 刘 磊 [mailto:liulei_198216@xxxxxxxxxxx]
>>发送时间: 2006年2月24日 8:27
>>收件人: hts-users@xxxxxxxxxxxxxxxxxxxxxxxxx
>>主题: [hts-users:00201] Re: emotional speech synthesis using hts
>>
>>Hi,Heiga ZEN
>>
>>Thanks for your help.
>>
>>I have read some paper about "PSOLA",
>>I find that speech synthesis with "PSOLA" needs "text analyzer" and
>>"prosody model".
>>The "prosody model" can forecast prosody(F0, duration) form text that
>> will
>>be synthesized.
>>
>>We all HTS uses festival as   "text analyzer",
>>does festival have the function of forecasting prosody.
>>
>>HTS uses MSD-HMM to model F0.
>>If festival can forecast prosody(F0),
>>How do they work together.
>>
>>thank you
>>
>>liulei
>>2006.2.24
>>
>>
>> >From: "Heiga ZEN (Byung Ha CHUN)" <zen@xxxxxxxxxxxxxxxx>
>> >Reply-To: hts-users@xxxxxxxxxxxxxxxxxxxxxxxxx
>> >To: hts-users@xxxxxxxxxxxxxxxxxxxxxxxxx
>> >Subject: [hts-users:00200] Re: emotional speech synthesis using hts
>> >Date: Wed, 22 Feb 2006 23:11:47 +0900
>> >
>> >Hi,
>> >
>> >liulei_198216@xxxxxxxxxxx wrote:
>> >
>> >>I have read some papers about emotional speech synthesis, and now,I
>> >>know that hts uses "model interpolation" and "speaker adaptation"
>> >>to synthesize  motional speech and speech with various styles.
>> >
>> >Yes.
>> >
>> >>About "speaker adaptation" , it refers that  for synthesizing
>> >>speech with various styles, we must  convert speech feature
>> >>including spectrum, F0,duration.
>> >
>> >Actually wee do not need to convert speech features themselves, we
>> >need to convert "statistics" (model parameters) of these features.
>> >
>> >>But when I want to synthesize emotional speech, is it necessary to
>> >>convert spectrum.
>> >>Is it enough that getting emotional speech through converting F0
>> >>,duration.
>> >
>> >In my opinion, converting spectrum will help to synthesize emotional
>> >speech.
>> >
>> >>In addition, how can I get a real-time emotional convertion, for
>> >>example from sad to happy.
>> >>Certainly we can use "model interpolation" and "speaker
>> >>adaptation", but  they need time in  the  training part.
>> >
>> >For speaker interpolation you have to prepare a number of models
>> >using sufficient emotional speech samples.
>> >Recording speech and training HMMs may take some time.
>> >On the other hand, for speaker adaptation you only need one set of
>> >HMMs trained using neutral speech and a few emotional speech samples
>> >for adaptation.
>> >Speaker (emotion) adaptation can be done off-line, so synthesizing
>> >emotional speech does not require any additional time.
>> >You can also use adapted models for interpolation.
>> >
>> >>does anyone know "speech synthesis driven by emotional function " ?
>> >
>> >I don't know what "emotional function" is.
>> >In the HMM-based speech synthesis system with MLLR-based speaker
>> >(emotion) adaptation, I think linear transformation matrices for
>> >mean and variances of the HMMs can be viewed as the functions to
>> >represent the relationships between neutral and emotional speech.
>> >
>> >Best regards,
>> >
>> >Heiga Zen (Byung Ha Chun)
>> >
>> >--
>> >  ------------------------------------------------
>> >   Heiga ZEN     (in Japanese pronunciation)
>> >   Byung-Ha CHUN (in Korean pronunciation)
>> >
>> >   Department of Computer Science and Engineering
>> >   Graduate School of Engineering
>> >   Nagoya Institute of Technology
>> >   Japan
>> >
>> >   e-mail: zen@xxxxxxxxxxxxxxxx
>> >      web: http://kt-lab.ics.nitech.ac.jp/~zen
>> >  ------------------------------------------------
>> >
>>
>>_________________________________________________________________
>>与联机的朋友进行交流,请使用 MSN Messenger:
http://messenger.msn.com/cn
>>
>
> _________________________________________________________________
> 免费下载 MSN Explorer:   http://explorer.msn.com/lccn/
>



_________________________________________________________________
与联机的朋友进行交流,请使用 MSN Messenger: http://messenger.msn.com/cn
Follow-Ups
[hts-users:00206] Re: emotional speech synthesis using hts, Heiga ZEN (Byung Ha CHUN)
References
[hts-users:00204] Re: emotional speech synthesis using hts, Nicholas Volk