[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:03264] Re: Regarding to the adaptation part of the demo


Thank you for your advise. I've tried the SAT training and adaptation, it did perform better than the one of SI model.


那兴宇 <nxy-yzqs@xxxxxxx> 於 2012年4月20日上午10:33 ?道:

SI training provides the basic model and regression-class tree for SAT training.
But you do not have to do that again, just use the SI model (including unseen models) and the regression-class tree you have already got.

Xingyu Na (那兴宇)
Beijing Institute of Technology
naxingyu at {facebook, twitter, linkedin}

At 2012-04-20 00:03:00,"li jay" <lij.acd@xxxxxxxxx> wrote:

Thank you for telling me this. I read the paper of Dr. Yamagishi, and I realized that speaker independent training  was a conventional way and speaker adaptive training (SAT) was a new one. I think that's why my generated adapted voice was not so good.

But there is something still confusing me: 
My training process used to train the average model (I call it the basic training step) was:
# preparing environments
# HCompV (computing variance floors)
# HInit & HRest (initialization & reestimation)

# HHEd (making a monophone mmf)
# HERest (embedded reestimation (monophone))

# HHEd (copying monophone mmf to fullcontext one)
# HERest (embedded reestimation (fullcontext))

# HHEd (tree-based context clustering)
# HERest (embedded reestimation (clustered))

# HHEd (untying the parameter sharing structure)
# HERest (embedded reestimation (untied))

# HHEd (tree-based context clustering)
# HERest (embedded reestimation (re-clustered))

# HFst (forced alignment using WFST for no-silent GV)
, which are in the order of what the script do, and they work. After running script above, I performed 1~5 to generate the SI model adapted voice. 
As to generating SAT model adapted voice, should I still run the script above (the basic training step), then use the script below? 
1 # HHEd (building regression-class trees for adaptation) and 6~9 and 10~13

Why I am confusing is because based on the paper of Dr. Yamagishi, it seems like that the training part of average voice model is different from the conventional one. I don't know if the script (the basic training step) is suitable or not.


<nxy-yzqs@xxxxxxx> 於 2012年4月19日上午12:55 ?道:

1~5 is adaptation based on SI model.
6~9 is speaker adaptive training for average voice model
10~13 is adaptation based on average voice model
For reference, please read Dr. Yamagishi's papers on the publication list of HTS website.

在 2012-04-18 20:24:05,"li jay" <lij.acd@xxxxxxxxx> 写道:


I want to ask something regarding to adaptation part of HTS-demo_CMU-ARCTIC-ADAPT demo Training.pl script.
I used sentences from several speakers to train a average model, and then used the following parts (1~5) of codes to adapt to specific speaker and generate voices:

1 # HHEd (building regression-class trees for adaptation)
2 # HERest (speaker adaptation (speaker independent))
3 # HERest (speaker adaptation (SI+MLLR+MAP))
4 # HMGenS (generating speech parameter sequences (speaker adapted))
5 # SPTK (synthesizing waveforms (speaker adapted))

The generated adapted voice was ok, but not so good. I want to ask what the following parts (6~9 and 10~13) are for? 

6 # HERest (Speaker adaptive training (SAT))
7 # HHEd (making unseen models (SAT))
8 # HMGenS (generating speech parameter sequences (SAT))
9 # SPTK (synthesizing waveforms (SAT))


10 # HERest (speaker adaptation (SAT))
11 # HERest (speaker adaptation (SAT+MLLR+MAP))
12 # HMGenS (generate speech parameter sequences (SAT+adaptation))
13 # SPTK (synthesizing waveforms (SAT+adaptation))

They all seem like adaptation and voice generation. What is the difference between them(1~5, 6~9, and 10~13)?


[hts-users:03256] Regarding to the adaptation part of the demo, li jay
[hts-users:03257] Re: Regarding to the adaptation part of the demo, nxy-yzqs
[hts-users:03260] Re: Regarding to the adaptation part of the demo, li jay
[hts-users:03261] Re: Regarding to the adaptation part of the demo, 那兴宇