Hi All,
I have used the speaker adaptation demo package to implement "Style Adaptation" for Anger emotion by following the paper written by Tachibana, Yamagishi, Masuko and Kobayashi: "A Style adaptation technique for speech synthesis using HSMM and supra segmental features". I have successfully obtained the style adapted voice by training, however it does not sound as emotional as it does when a simple TTS using HSMM is built. Furthermore, there is a sudden shriek sound or muffledness due to which some phones get swallowed and is really unpleasant to listen to. I think this may be due to the FRAMESHIFT and FRAMELEN i am using as its not according to that mentioned in paper. Do these factors affect/disturb synthesis? Can anyone guide me as to why muffled or shriek sounds are present in synthesized speech? How can i improve the system as to increase the style's effect (anger/joy) on natural speech? My configuration is as follows:
./configure DATASET=uet_ur TRAINSPKR=mar_neu ADAPTSPKR=mar_ang ADAPTHEAD=b0
ALLSPKR='mar_neu mar_ang' F0_RANGES='mar_neu 70 625 mar_ang 77 635' GAMMA=0
FREQWARP=0.55 FRAMELEN=1440 FRAMESHIFT=288 SAMPFREQ=48000 TRANSKIND=mean
MGCOCCTHRESH=1000.0 LF0OCCTHRESH=150.0 ADDMAP=0 USESMAP=FALSE
--with-tcl-search-path=/usr/bin
--with-fest-search-path=/home/ammarah/Festival_SpeechTools/festival/examples --with-sptk-search-path=/usr/local/SPTK/bin
--with-hts-search-path=/usr/local/HTS-2.2beta/bin
--with-hts-engine-search-path=/usr/local/bin
Regards,
Ammarah.