State Duration Modeling for HMM-Based Speech Synthesis

Correction in State Duration Modeling for HMM-Based Speech Synthesis

In [1], [2], we defined $\chi_{t_0,t_1}(i)$ , the probability of staying at state $i$

from time

given an observation sequence $\bO = \{\bo_1, \ldots, \bo_T\}$ of length $T$

, as

$\begin{equation} \chi_{t_0,t_1}(i) = (1-\gamma_{t_0-1}(i)) \cdot\prod_{t=t_0}^{t_1}\gamma_t(i) \cdot(1-\gamma_{t_1+1}(i)), \label{eqn:prev-chi} \end{equation}$

where $\gamma_t(i)$ is the probability of being in state $i$ at time $T$ , and we defined $\gamma_{-1}(i) = \gamma_{T+1}(i) = 0$ . Based on $\chi_{t_0,t_1}(i)$ , the mean $\xi(i)$ and the variance $\sigma^2(i)$ of the state duration density of state $i$ is obtained as

$\begin{align} \xi(i) & = \frac{\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T \chi_{t_0,t_1}(i)(t_1 - t_0 + 1)} {\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T \chi_{t_0,t_1}(i)}, \label{eqn:dur-mean} \end{align}$

$\begin{align} \sigma^2(i) & = \frac{\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T \chi_{t_0,t_1}(i)(t_1 - t_0 + 1)^2} {\displaystyle\sum_{t_0=1}^T\sum_{t_1=t_0}^T \chi_{t_0,t_1}(i)} - \xi^2(i). \label{eqn:dur-var} \end{align}$

However, the previous definition of $\chi_{t_0,t_1}(i)$ is statistially incorrect because the state transitions were not taken into account.

We redefine $\chi_{t_0,t_1}(i)$ in a statistically correct manner as

$\begin{eqnarray} \lefteqn{\chi_{t_0,t_1}(i)} \nonumber\\ & = & P(q_{t_0-1}\neq i,q_{t_0}=i, \nonumber\\ & & \qquad \ldots,q_{t_1}=i,q_{t_1+1}\neq i|\bO, \lambda) \nonumber\\ & = & \frac{1}{P(\bO|\lambda)} \left(\sum_{j\neq i}\alpha_{t_0-1}(j)a_{ji}\right) \cdot a_{ii}^{t_1-t_0} \nonumber\\ & & \cdot\prod_{t=t_0}^{t_1}b_i(\bo_t) \cdot\left(\sum_{k\neq i}a_{ik}b_k(\bo_{t_1+1}) \beta_{t_1+1}(k)\right), \label{eqn:new-chi} \end{eqnarray}$

where $q_t$ denotes the state at time $T$ , $\lambda$ denotes the parmeter set of the HMM, $\alpha_t(j)$ and $\beta_t(j)$ denote the forward and backward variables, and $a_{ij}$ and $b_i(\bo)$ denote the state transition probability and the output probability, respectively.

References

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, ``Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,'' IEICE Trans. D-II, vol.J83-D-II, no.11, pp.2099--2107, Nov. 2000 (in Japanese).
T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura, ``Duration modeling for HMM-based speech synthesis,'' Proc. ICSLP-98, vol.2, Tu3A4, pp.29--32, Nov. 1998.