Correction in State Duration Modeling for HMM-Based Speech Synthesis

In [1], [2], we defined $\chi_{t_0,t_1}(i)$, the probability of staying at state $i$ from time $t_0$ to $t_1$ given an observation sequence $\bO = \{\bo_1, \ldots, \bo_T\}$ of length $T$, as


	\begin{equation}
		\chi_{t_0,t_1}(i)
		= (1-\gamma_{t_0-1}(i))
			\cdot\prod_{t=t_0}^{t_1}\gamma_t(i)
			\cdot(1-\gamma_{t_1+1}(i)), 
		\label{eqn:prev-chi}
	\end{equation}

where $\gamma_t(i)$ is the probability of being in state $i$ at time $T$, and we defined $\gamma_{-1}(i) = \gamma_{T+1}(i) = 0$. Based on $\chi_{t_0,t_1}(i)$, the mean $\xi(i)$ and the variance $\sigma^2(i)$ of the state duration density of state $i$ is obtained as


\begin{align}
	\xi(i)
	& = \frac{\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T
			\chi_{t_0,t_1}(i)(t_1 - t_0 + 1)}
		{\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T
			\chi_{t_0,t_1}(i)}, 
		\label{eqn:dur-mean}
\end{align}


\begin{align}
	\sigma^2(i)
	& = \frac{\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T
			\chi_{t_0,t_1}(i)(t_1 - t_0 + 1)^2}
		{\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T
			\chi_{t_0,t_1}(i)} - \xi^2(i).
		\label{eqn:dur-var}
\end{align}

However, the previous definition of $\chi_{t_0,t_1}(i)$ is statistially incorrect because the state transitions were not taken into account.

We redefine $\chi_{t_0,t_1}(i)$ in a statistically correct manner as


\begin{eqnarray}
	\lefteqn{\chi_{t_0,t_1}(i)} \nonumber\\
	& = & P(q_{t_0-1}\neq i,q_{t_0}=i,
		\nonumber\\
	& & \qquad \ldots,q_{t_1}=i,q_{t_1+1}\neq i|\bO, \lambda)
		\nonumber\\
	& = & \frac{1}{P(\bO|\lambda)}
		\left(\sum_{j\neq i}\alpha_{t_0-1}(j)a_{ji}\right)
		\cdot a_{ii}^{t_1-t_0}
		\nonumber\\
	& & \cdot\prod_{t=t_0}^{t_1}b_i(\bo_t)
		\cdot\left(\sum_{k\neq i}a_{ik}b_k(\bo_{t_1+1})
		\beta_{t_1+1}(k)\right),
	\label{eqn:new-chi}
\end{eqnarray}

where $q_t$ denotes the state at time $T$, $\lambda$ denotes the parmeter set of the HMM, $\alpha_t(j)$ and $\beta_t(j)$ denote the forward and backward variables, and $a_{ij}$ and $b_i(\bo)$ denote the state transition probability and the output probability, respectively.

References

  1. T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, ``Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,'' IEICE Trans. D-II, vol.J83-D-II, no.11, pp.2099--2107, Nov. 2000 (in Japanese).
  2. T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura, ``Duration modeling for HMM-based speech synthesis,'' Proc. ICSLP-98, vol.2, Tu3A4, pp.29--32, Nov. 1998.