[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:04204] questions about energy normalization



Hi,

I notice that the recent STRAIGHT versions of the HTS demo (2.3alpha and 2.3beta) include a simple form of energy normalization, effectively scaling the waveforms in the training corpus to have constant raw energy. This seems reasonable enough. However surrounding this change there seems to be some extra scaling steps that have been added that I have a few questions about.

A 16-bit waveform can either be viewed as a series of floats between -1.0 and 1.0 (as MATLAB does by default) or as a series of ints between -32767 and 32768 (as the format on disk typically does). I'll define the root mean squared value (RMSV) of a waveform x as sqrt(mean(x .* x)) where x is a sequence of floats.

In data/Makefile.in each training corpus waveform is effectively scaled to have RMSV equal to 2200 before speech parameters are extracted. Training is then performed, followed by synthesis. If no additional scaling was performed at synthesis time then the outgoing waveforms would also have RMSV somewhere around 2200 (and would clip like crazy). However an extra scaling factor of 1024 / (2200 * 32768) is included on the outgoing waveform (in gen_wave in scripts/Training.pl), which results in the final synthesized waveforms having RMSV somewhere around 1024 / 32768 = 0.031.

This raises a number of questions. Firstly, why is the scale factor of 2200 applied before training and then unapplied before synthesis? Is there a reason to use this at all?

Secondly, I presume the scale factor of 1024 / 32768 on the outgoing waveform is intended to make the effective bit depth around 10 bits to allow plenty of head room so that parts of the utterance with momentarily much larger energy do not clip? By my reckoning this is somewhere around -30 dBov. This does seem quite quiet!

Finally, could I kindly suggest that the variable name and comments in data/Makefile.in are very misleading? "STRMAGIC" is just a scale factor, not a "magic number", and it does not "turn off normalization". "NONORMRATE" is not a "rate" in any sense I'm familiar with and "NONORM" suggests it is turning off normalization, whereas in fact it is involved in performing normalization. It would also be helpful if the comment in scripts/Training.pl mentioned what the scale factor of 1024 / 32768 was trying to achieve. Initially I presumed 1024 was mistakenly intended to be something to do with the FFT length.

Thanks,

Matt