DeepMind researchers and I developed a new generative model for audio signals named "WaveNet". We can draw a waveform sample-by-sample from this model. By conditioning linguistic features derived from a text, it can be used for text-to-speech. It has already overtaken the existing concatenative and parametric TTS significantly. You can find the result, speech samples, and paper at DeepMind's blog post.
I believe that this is a milestone in statistical parametric speech synthesis :-)
I'm looking forward to hearing your feedbacks.