Tutorial on HMM-based speech synthesis

Tutorial at Interspeech 2009 filelink fileslide (40MB)
"Fundamentals and recent advances in HMM-based speech synthesis"
Keiichi Tokuda (Nagoya Insitute of Technology)
Heiga Zen (Toshiba Europe Research Ltd. Cambridge Research Lab.)

Introduction
Over the last ten years, the quality of speech synthesis has drastically improved with the rise of general corpus-based speech synthesis. Especially, state-of-the-art unit selection speech synthesis can generate natural-sounding high quality speech. However, for constructing human-like talking machines, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker's voice characteristics, various speaking styles including native and non-native speaking styles in different languages, varying emphasis and focus, and/or emotional expressions; it is still difficult to have such flexibility with unit-selection synthesizers, since they need a large-scale speech corpus for each voice.

In recent years, a kind of statistical parametric speech synthesis based on hidden Markov models (HMMs) has been developed. The system has the following features:

  • Original speaker's voice characteristics can easily be reproduced because all speech features including spectral, excitation, and duration parameters are modeled in a unified framework of HMM, and then generated from the trained HMMs themselves.
  • Using a very small amount of adaptation speech data, voice characteristics can easily be modified by transforming HMM parameters by a speaker adaptation technique used in speech recognition systems. From these features, the HMM-based speech synthesis approach is expected to be useful for constructing speech synthesizers which can give us the flexibility we have in human voices.

In this tutorial, the system architecture is outlined, and then basic techniques used in the system, including algorithms for speech parameter generation from HMM, are described with simple examples. Relation to the unit selection approach, trajectory modeling, recent improvements, and evaluation methodologies are are summarized. Techniques developed for increasing the flexibility and improving the speech quality are also reviewed.

Main target audience includes

  • students who are going to work on speech synthesis
  • researchers interested in statistical parametric speech synthesis
  • developers who want to integrate HMM-based speech synthesis module into their speech systems, e.g., spoken dialog systems, speech-to-speech translation systems

Presentation outline

  • Overview
    • Corpus-Based Speech Synthesis
    • Unit selection vs statistical parametric speech synthesis
  • Basics
    • Vocoding techniques
      • Source-filter model
      • LP/LSP analysis
      • mel-cepstral analysis
      • MGC analysis
    • Speech Parameter generation
      • Definition of Hidden Markov model (HMM)
      • Speech Parameter Generation from HMM with dynamic features
      • Determination of State Durations
      • Solution for The Problem
    • F0 pattern modeling
      • Observation of F0
      • MSD-HMM for F0 Modeling
    • Model structure
      • HMM topology
      • Context-dependent modeling
      • Decision tree-based context clustering
      • MDL criterion
  • Relation to the unit selection approach
    • Comparison between Two Approaches
    • Hybrid approaches
      • Target prediction
      • Unit smoothing
      • Mixing natural and generated units
  • Trajectory modeling
    • Derivation of trajectory HMM
    • Relationship between trajectory HMM & HMM-based speech synthesis
  • Recent improvements and evaluation
    • STRAIGHT
    • Statistical mixed excitation
    • HSMM
    • MGE training
    • GV-based parameter generation algorithm
    • Blizzard Challenge
  • Flexibility of the approach
    • Speaker adaptation (mimicking voices)
      • MLLR, MAP, SAT, etc
    • Speaker Interpolation (mixing voices)
    • Eigenvoices (producing voices)
    • Multiple-regression (controlling voices)
    • Multilingual speech synthesis
    • Singing voice synthesis
  • Applications
    • Audio-visual speech synthesis
    • Human motion synthesis and others
    • Hand-writing recognition
    • Small-footprint synthesizer for mobile devices
  • Software
    • SPTK, HTS, hts_engine, ARCTIC, etc.
  • Summary

Short Bios:
Keiichi Tokuda received the Dr.Eng. degree from Tokyo Institute of Technology in 1989. He is now the director of the Speech Processing Laboratory and a Professor in the Department of Computer Science and Engineering at Nagoya Insitute of Technology. He has been an invited researcher at ATR Spoken Language Translation Research Laboratories and was a visiting researcher at Carnegie Mellon University from 2001 to 2002. He has been working on HMM-based speech synthesis after he proposed an algorithm for speech parameter generation from HMM in 1995. He is also the principal designer of opensource software packages: HTS (http://hts.sp.nitech.ac.jp/) and SPTK (http://sp-tk.sourceforge.net/). In 2005, Keiichi Tokuda and Dr. Alan Black (CMU) organized the largest ever evaluation of corpus-based speech synthesis techniques, the Blizzard Challenge, which has progressed to an annual event. He was a member of the Speech Technical Committee of the IEEE Signal Processing Society from 2000 to 2003, and acts as organizer and reviewer for many major speech conferences, workshops and journals. He published over 60 journal papers and over 130 conference papers, and received 5 paper awards.

Heiga Zen received the Dr.Eng. degree in computer science and engineering from Nagoya Institute of Technology in 2006. He is now a Research Engineer at Speech Technology Group, Toshiba Research Europe Ltd. Cambridge Research Laboratory. He was an intern researcher at ATR Spoken Language Translation Research Laboratories in 2003 and an intern/co-op researcher at IBM T.J. Watson Research Center from 2004 to 2005. From April 2006 to July 2008, he was a postdoctoral research associate at Nagoya Institute of Technology. He has been working on HMM-based speech synthesis for 8 years after joining Prof. Tokuda's research group in 2000. He was also the main maintainer of HTS, one of the principal authors of the Festival Speech Synthesis System, one of the main developers of SPTK, and one of the active contributors of the hidden Markov model toolkit (HTK). He published 10 journal papers and over 30 conference papers, and received 5 paper awards.


Front page   Edit Freeze Diff Backup Upload Copy Rename Reload   New List of pages Search Recent changes   Help   RSS of recent changes
Last-modified: 2009-09-14 (Mon) 00:00:00 (2780d)