Tutorial - HMM/DNN-based speech synthesis system (HTS)

Tutorial on HMM-based speech synthesis†

Tutorial at Interspeech 2009 link slide (40MB)
"Fundamentals and recent advances in HMM-based speech synthesis"
Keiichi Tokuda (Nagoya Insitute of Technology)
Heiga Zen (Toshiba Europe Research Ltd. Cambridge Research Lab.)

Introduction
Over the last ten years, the quality of speech synthesis has drastically improved with the rise of general corpus-based speech synthesis. Especially, state-of-the-art unit selection speech synthesis can generate natural-sounding high quality speech. However, for constructing human-like talking machines, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker's voice characteristics, various speaking styles including native and non-native speaking styles in different languages, varying emphasis and focus, and/or emotional expressions; it is still difficult to have such flexibility with unit-selection synthesizers, since they need a large-scale speech corpus for each voice.

In recent years, a kind of statistical parametric speech synthesis based on hidden Markov models (HMMs) has been developed. The system has the following features:

Original speaker's voice characteristics can easily be reproduced because all speech features including spectral, excitation, and duration parameters are modeled in a unified framework of HMM, and then generated from the trained HMMs themselves.

Using a very small amount of adaptation speech data, voice characteristics can easily be modified by transforming HMM parameters by a speaker adaptation technique used in speech recognition systems. From these features, the HMM-based speech synthesis approach is expected to be useful for constructing speech synthesizers which can give us the flexibility we have in human voices.

In this tutorial, the system architecture is outlined, and then basic techniques used in the system, including algorithms for speech parameter generation from HMM, are described with simple examples. Relation to the unit selection approach, trajectory modeling, recent improvements, and evaluation methodologies are are summarized. Techniques developed for increasing the flexibility and improving the speech quality are also reviewed.

Main target audience includes

students who are going to work on speech synthesis

researchers interested in statistical parametric speech synthesis

developers who want to integrate HMM-based speech synthesis module into their speech systems, e.g., spoken dialog systems, speech-to-speech translation systems

Presentation outline

Overview
Corpus-Based Speech Synthesis

Unit selection vs statistical parametric speech synthesis

Basics
Vocoding techniques
Source-filter model

LP/LSP analysis

mel-cepstral analysis

MGC analysis

Speech Parameter generation
Definition of Hidden Markov model (HMM)

Speech Parameter Generation from HMM with dynamic features

Determination of State Durations

Solution for The Problem

F0 pattern modeling
Observation of F0

MSD-HMM for F0 Modeling

Model structure
HMM topology

Context-dependent modeling

Decision tree-based context clustering

MDL criterion

Relation to the unit selection approach
Comparison between Two Approaches

Hybrid approaches
Target prediction

Unit smoothing

Mixing natural and generated units

Trajectory modeling
Derivation of trajectory HMM

Relationship between trajectory HMM & HMM-based speech synthesis

Recent improvements and evaluation
STRAIGHT

Statistical mixed excitation

HSMM

MGE training

GV-based parameter generation algorithm

Blizzard Challenge

Flexibility of the approach
Speaker adaptation (mimicking voices)
MLLR, MAP, SAT, etc

Speaker Interpolation (mixing voices)

Eigenvoices (producing voices)

Multiple-regression (controlling voices)

Multilingual speech synthesis

Singing voice synthesis

Applications
Audio-visual speech synthesis

Human motion synthesis and others

Hand-writing recognition

Small-footprint synthesizer for mobile devices

Software
SPTK, HTS, hts_engine, ARCTIC, etc.

Summary

Short Bios:
Keiichi Tokuda received the Dr.Eng. degree from Tokyo Institute of Technology in 1989. He is now the director of the Speech Processing Laboratory and a Professor in the Department of Computer Science and Engineering at Nagoya Insitute of Technology. He has been an invited researcher at ATR Spoken Language Translation Research Laboratories and was a visiting researcher at Carnegie Mellon University from 2001 to 2002. He has been working on HMM-based speech synthesis after he proposed an algorithm for speech parameter generation from HMM in 1995. He is also the principal designer of opensource software packages: HTS (http://hts.sp.nitech.ac.jp/) and SPTK (http://sp-tk.sourceforge.net/). In 2005, Keiichi Tokuda and Dr. Alan Black (CMU) organized the largest ever evaluation of corpus-based speech synthesis techniques, the Blizzard Challenge, which has progressed to an annual event. He was a member of the Speech Technical Committee of the IEEE Signal Processing Society from 2000 to 2003, and acts as organizer and reviewer for many major speech conferences, workshops and journals. He published over 60 journal papers and over 130 conference papers, and received 5 paper awards.

Heiga Zen received the Dr.Eng. degree in computer science and engineering from Nagoya Institute of Technology in 2006. He is now a Research Engineer at Speech Technology Group, Toshiba Research Europe Ltd. Cambridge Research Laboratory. He was an intern researcher at ATR Spoken Language Translation Research Laboratories in 2003 and an intern/co-op researcher at IBM T.J. Watson Research Center from 2004 to 2005. From April 2006 to July 2008, he was a postdoctoral research associate at Nagoya Institute of Technology. He has been working on HMM-based speech synthesis for 8 years after joining Prof. Tokuda's research group in 2000. He was also the main maintainer of HTS, one of the principal authors of the Festival Speech Synthesis System, one of the main developers of SPTK, and one of the active contributors of the hidden Markov model toolkit (HTK). He published 10 journal papers and over 30 conference papers, and received 5 paper awards.

↑

HMM/DNN-based Speech Synthesis System (HTS) - Tutorial

Tutorial on HMM-based speech synthesis†

Contents

Links

recent(10)