[Subject Prev][Subject Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:03126] Bug in the setting of the state durations


Hi,
We have found a bug in the algorithm to set the state duration both in HTS and in hts-engine.
Basically the problem is due to the rounding of the number of frames and the carry over of the non-assigned frames. The problem happens when the phone duration is given and also when the speech rate is different than one.
I think that the main reason for it is that Yoshimura's equation is based on Gaussians, but the duration has a lower bound of 1 frame. We propose to modify the algorithm to assign the state durations as follows:
1) For a set of states Q find the number of available frames based on the asigned model duration and/or the defined speaking rate
2) Compute rho and set an initial duration for each state based on Yoshimura's equation: (d[s] = mean[s]+rho/var[s])
3) Compute the difference between the available states computed in step 1) and the assigned states computed in step 2).
4) Find the state on which modifying the assign duration by one frame would have the highest log-likelihood and add 1 frame to the assigned duration
5) Back to step 3) until the absolute value of the difference between available and asigned frames is less than 1.
6) Carry over the remaining difference for the next block of states (Note that this difference is always less than 1 frame)

I attach two files, with the proposed modification: one for HTS (HGen.c:SetStateDurations) and the other one for hts-engine (HTS_sstream.c:HTS_set_duration).
The one for HTS works for labels in HTK fotmsy but it might require some modificatios for other type of label formats.
For the modification of hts-engine please note that in our internal version vari is the variance, not its inverse.

Kind regards,

-Javier Latorre

Toshiba Research Europe Limited, Cambridge, UK
/* ****************************************************************
 * Modification for HGen.c to substitute function SetStateDurations 
 ******************************************************************** */


/* SetStateDurations: set state durations */
static void SetStateDurations (GenInfo *genInfo)
{
   int i,j,k,s,m,cnt,nStates,modeldur,start=0,tframe=0;
   int block_begin=-1,block_end=-1;
   double sum,sqr,dur,rho=0.0,global_rho=0.0,diff=0.0;
   double availableFrames=0.0;
   int blockStates=0;
   int addSign,mostFlexibleModel,mostFlexibleState;
   double stateFlexibility, maxStateFlexibility,stateNewDur;
   DVector *mean, *ivar;
   Label *label;
   HLink dm;
   int l;
   if (genInfo->speakRate!=1.0 && rFlags&RNDDUR)
      HError(9999,"SetStateDurations: Cannot change speaking rate in random duration generation");
   if (genInfo->modelAlign && rFlags&RNDDUR)
      HError(9999,"SetStateDurations: Cannot use model-level alignments in random duration generation");

   /* state duration statistics storage */
   if ((mean = (DVector *) New(genInfo->genMem, genInfo->labseqlen*sizeof(DVector)))== NULL)
      HError(9905,"SetStateDurations: Cannot allocate memory for mean");
   if ((ivar = (DVector *) New(genInfo->genMem, genInfo->labseqlen*sizeof(DVector)))== NULL)
      HError(9905,"SetStateDurations: Cannot allocate memory for inverse variance");
   mean--; ivar--;

   /* prepare duration and calculate statistics to set speaking rate control parameter, rho */
   for (i=1; i<=genInfo->labseqlen; i++) {
      /* duration model for the i-th state */
      dm = genInfo->dm[i];

      /* # of states in the i-th model */
      nStates = genInfo->hmm[i]->numStates-2;
      mean[i] = CreateDVector(genInfo->genMem, nStates);
      ivar[i] = CreateDVector(genInfo->genMem, nStates);

      /* set statistics of the i-th state */
      for (s=cnt=1; s<=genInfo->dset->swidth[0]; s++) {
         for (k=1; k<=genInfo->dset->swidth[s]; k++,cnt++) {
            mean[i][cnt] = dm->svec[2].info->pdf[s].info->spdf.cpdf[1].mpdf->mean[k];
            mean[i][cnt] = (mean[i][cnt]<0.0) ? 1.0 : mean[i][cnt];   /* sometimes negative duration happens because Gaussian allows */
            switch(dm->svec[2].info->pdf[s].info->spdf.cpdf[1].mpdf->ckind) {
            case DIAGC:    ivar[i][cnt] = 1.0 / (double) dm->svec[2].info->pdf[s].info->spdf.cpdf[1].mpdf->cov.var[k]; break;
            case INVDIAGC: ivar[i][cnt] = (double) dm->svec[2].info->pdf[s].info->spdf.cpdf[1].mpdf->cov.var[k]; break;
            case FULLC:    ivar[i][cnt] = (double) dm->svec[2].info->pdf[s].info->spdf.cpdf[1].mpdf->cov.inv[k][k]; break;
            }
         }
      }
   }        
   block_begin = 1;
   /* set state durations of given label sequence */
   while (block_begin <= genInfo->labseqlen){     
      /* Compute block_end and availableFrames */
      sum = sqr = 0.0;      	      
      if (genInfo->modelAlign){
	/* use model-level aligment */
	label = genInfo->label[block_begin];      
	blockStates = genInfo->hmm[block_begin]->numStates-2;
	/* Compute mean and variance for the first label in the block*/
	CountDurStat(mean[block_begin], ivar[block_begin], &sum, &sqr, genInfo->sindex[block_begin]);
	if (label->start >= 0.0) {	
	  if (label->end>=0.0) {  
	    /* model-level alignment of the i-th label is fully specified.
	     * The block spans only one label
	     * */
	    block_end = block_begin+1;
	    availableFrames = (label->end-label->start)/genInfo->frameRate+diff;	
	  }
	  else {  
	    /* model-level alignment of the i-th label is not specified. 
	     * Find next defined start to see if there is a block
	     * */
	   for (l=block_begin+1; l<=genInfo->labseqlen; l++) {
	     if (genInfo->label[l]->start > label->start) {
	       /* Next start found which is higher than the previous one => There is a block_end*/
	       block_end = l;
	       availableFrames = (genInfo->label[l]->start-label->start)/genInfo->frameRate+diff;	       	       
	       break;
	     }
	     CountDurStat(mean[l], ivar[l], &sum, &sqr, genInfo->sindex[l]);
	     blockStates += genInfo->hmm[l]->numStates-2;
	     /* There should not be a label->end if there was no label->start 
		if (genInfo->label[l]->end>=0.0) {
		if (genInfo->label[l]->end<label->start)
		HError(9999,"SetStateDurations: start time %f is smaller than end time %f", (double)pre_end, (double)genInfo->label[l]->end);
		rho = (((genInfo->label[l]->end-pre_end)/genInfo->frameRate)-sum)/sqr;
		break;
		}
	     */
	   }
	   if (l>genInfo->labseqlen) {
	     /* No block_end found, therefore the block run until the end of the sentence */
	     HError(-9999,"SetStateDurations: model duration is not specified in the finel label");
	     genInfo->modelAlign = FALSE;
	     block_end = genInfo->labseqlen+1;
	     availableFrames = sum*genInfo->speakRate;
	   }	    	    
	  }
	}
	else{
	  /* This block has no starting time. 
	   * Look for the next label with a start time stamp to find the block_end
	   * */
	  for (l=block_begin+1; l<=genInfo->labseqlen; l++){
	    /* found next start */
	    if (genInfo->label[l]->start >0.0){
	      /* Block_end found within the sentence */
	      block_end = l;
	      break;
	    }
	    CountDurStat(mean[l], ivar[l], &sum, &sqr, genInfo->sindex[l]);
	    blockStates += genInfo->hmm[l]->numStates-2;
	  }
	  if (l>genInfo->labseqlen) {
	    /* No block_end found => The block runs until the end of the sentence */
	    block_end = genInfo->labseqlen+1;
	    genInfo->modelAlign = FALSE;
	  }
	  availableFrames = sum*genInfo->speakRate;
	}
      }
      else{
	/* No model alighment => The number of available states depends only on the model 
	 * and the desired global speaking rate 	   
	 */
	blockStates = 0;
	for (i=block_begin; i<=genInfo->labseqlen;i++){
	  CountDurStat(mean[i], ivar[i], &sum, &sqr, genInfo->sindex[i]);
	  blockStates += genInfo->hmm[i]->numStates-2;
	}	
	availableFrames = sum*genInfo->speakRate;      
	block_end = genInfo->labseqlen+1;	
      }      
      /* Add the difference extra frames from the previous block (remember that fabs(diff)<1.0) 
       * and adjust so that there is at least one frame for each state in the block 
       * */
      availableFrames += diff;
      availableFrames = (availableFrames < blockStates) ? 1.0*blockStates : availableFrames;      
      rho = (availableFrames-sum)/sqr;      
      /* Assign the available frames across the states of the block */
      for (i=block_begin; i<block_end; i++){
	/* First assign using default equation. 
	 *  See T. Yoshimura, et al. "Duration Modeling in HMM-based Speech Synthesis System",
	 * Proc. of ICSLP, vol.2, pp.29-32, 1998, for detail
	 * */	  
	label = genInfo->label[i];
	for (j=1; genInfo->sindex[i][j]!=0; j++) {
	  k = genInfo->sindex[i][j]-1;
	  if (rFlags&RNDDUR)
	    dur = GaussDeviate(mean[i][k],sqrt(1.0/ivar[i][k])); /* random duration sampling */
	  else
	    dur = mean[i][k]+rho/ivar[i][k];	  
         genInfo->durations[i][j] = (int)(dur+0.5);
         /* set minimum duration -> 1 */
         if (genInfo->durations[i][j]<1)
            genInfo->durations[i][j] = 1;
	 availableFrames -= genInfo->durations[i][j];
	}
      }
      /* Now assign any remaining or exceeding frame according to the state 'elasticity' (log-likelihood)*/
      while (fabs(availableFrames)>=1.0){  
	mostFlexibleModel = -1;
	mostFlexibleState = -1;
	addSign = (availableFrames>0)? 1 : -1;
	/* find the most 'flexible' state */
	for (i = block_begin; i<block_end; i++){
	  for (j=1; genInfo->sindex[i][j]!=0; j++){
	    k = genInfo->sindex[i][j]-1;
	    stateNewDur = genInfo->durations[i][j]+addSign;
	    if (stateNewDur > 0.0) {
	      stateFlexibility = -(stateNewDur - mean[i][k])*ivar[i][k];
	      stateFlexibility *= (stateNewDur - mean[i][k]);
	      stateFlexibility += log(ivar[i][k]);
	      if (mostFlexibleState <0){
		maxStateFlexibility = stateFlexibility;
		mostFlexibleState = j;
		mostFlexibleModel = i;
	      } 
	      else 
		if (stateFlexibility > maxStateFlexibility){
		  maxStateFlexibility = stateFlexibility;
		  mostFlexibleState = j;
		  mostFlexibleModel = i;
		}
	    }
	  }
	}
	if ((mostFlexibleModel>=1)&&(mostFlexibleState>=1)){
	  /* Modify the duration of the most flexible state in one frame*/
	  genInfo->durations[mostFlexibleModel][mostFlexibleState] += addSign;
	  /* modify the amount of unassigned frames */
	  availableFrames -= addSign;
	}
	else{
	  /* Although there are still unassigned frames  
	   * all states have only one state frame. 
	   * Stop here and do not carry over the remaining frames.
	   * */
	  availableFrames = 0.0;
	  break;
	}
      }     
      diff = availableFrames;
      block_begin=block_end;
   }
   /* Now calc model durations and write them to the labels */
   start = 0;
   for (i=1; i<=genInfo->labseqlen; i++){
     modeldur = 0;       
     for (j=1; genInfo->sindex[i][j]!=0; j++)
       modeldur += genInfo->durations[i][j];
     tframe += modeldur;
     label = genInfo->label[i];
     label->start = (HTime)start*genInfo->frameRate;
     label->end = (HTime)(start+modeldur)*genInfo->frameRate;
     start += modeldur;
   }
   genInfo->tframe = tframe;
   /* free memory */
   Dispose(genInfo->genMem, ++mean);   
   return;
}
/* ****************************************************************
 * Modification for HTS_sstream.c to substitute function HTS_set_duration 
 ******************************************************************** */
static void HTS_set_duration(int *duration, double *mean, double *vari,
                             double *remain, int size, double frame_length)
{
   int i;
   double temp1, temp2;
   double rho = 0.0;
   int mostFlexibleState;
   double logLikelihood = 0.0;
   double temp3;
   double availableLength = 0.0;
   int add;
   /* Check that all the mean values are positive because sometimes negative duration might happen */
   for (i=0; i<size; i++)
     mean[i] = (mean[i]<0.0? 1.0 : mean[i]);
   
   if (frame_length != 0.0) {   /* if frame length is specified, rho is determined */
      temp1 = 0.0;
      temp2 = 0.0;
      for (i = 0; i < size; i++) {

         temp1 += mean[i];
         temp2 += vari[i];
      }
      availableLength = frame_length + *remain;
      if (availableLength < size)
	availableLength = size;
      rho = (availableLength - temp1) / temp2;
      temp1 = 0.0;
      for (i = 0; i < size; i++) {
	duration[i] = (int)(mean[i] + rho * vari[i] + 0.5); /* Integer duration for state with the rho adaptation */
	if (duration[i] <1){
	  duration [i] = 1;
	}
	temp1 += duration[i]; /* acumulate duration */
      }
      temp2 = availableLength - temp1;

      /* Now add or substract the remaining integer duration distributing it among all the states*/      
      while (fabs(temp2)>=0.99999){ /* This should be 1.0, but for some reason our compiler doesn't makes it to work correctly then */
	if (temp2>0)
	  add = (temp2>0)? 1:0;
	/* Find first modifiable state */
	for (i=0; i<size; i++)
	  if (duration[i]+add>=1){
	    mostFlexibleState = i;
	    break;
	  }
	logLikelihood = 0.0;
       	logLikelihood -= (duration[i]+add-mean[i])*(duration[i]+add-mean[i])/vari[i];
	logLikelihood -= log(vari[i]);
	/* Find the most flexible state */
	for (i = mostFlexibleState+1; i < size; i++){
	  temp3 = 0.0;
	  temp3 -= (duration[i]+add-mean[i])*(duration[i]+add-mean[i])/vari[i];
	  temp3 -= log(vari[i]);
	  if ((temp3 > logLikelihood)&&(duration[i]+add>=1)){
	    mostFlexibleState = i;
	    logLikelihood = temp3;
	  }
	}
	duration[mostFlexibleState] += add;
	temp2 -= add;       
	
      }
      *remain = temp2;
   }
   else{
     for (i = 0; i < size; i++) {
       duration[i] = (int)(mean[i] + 0.5);
       *remain = 0.0;
     } 
   
   }
   return;
}

Follow-Ups
[hts-users:03127] Re: Bug in the setting of the state durations, Keiichiro Oura