[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[hts-users:04204] questions about energy normalization
Hi,
I notice that the recent STRAIGHT versions of the HTS demo (2.3alpha and
2.3beta) include a simple form of energy normalization, effectively
scaling the waveforms in the training corpus to have constant raw
energy. This seems reasonable enough. However surrounding this change
there seems to be some extra scaling steps that have been added that I
have a few questions about.
A 16-bit waveform can either be viewed as a series of floats between
-1.0 and 1.0 (as MATLAB does by default) or as a series of ints between
-32767 and 32768 (as the format on disk typically does). I'll define
the root mean squared value (RMSV) of a waveform x as sqrt(mean(x .* x))
where x is a sequence of floats.
In data/Makefile.in each training corpus waveform is effectively scaled
to have RMSV equal to 2200 before speech parameters are extracted.
Training is then performed, followed by synthesis. If no additional
scaling was performed at synthesis time then the outgoing waveforms
would also have RMSV somewhere around 2200 (and would clip like crazy).
However an extra scaling factor of 1024 / (2200 * 32768) is included on
the outgoing waveform (in gen_wave in scripts/Training.pl), which
results in the final synthesized waveforms having RMSV somewhere around
1024 / 32768 = 0.031.
This raises a number of questions. Firstly, why is the scale factor of
2200 applied before training and then unapplied before synthesis? Is
there a reason to use this at all?
Secondly, I presume the scale factor of 1024 / 32768 on the outgoing
waveform is intended to make the effective bit depth around 10 bits to
allow plenty of head room so that parts of the utterance with
momentarily much larger energy do not clip? By my reckoning this is
somewhere around -30 dBov. This does seem quite quiet!
Finally, could I kindly suggest that the variable name and comments in
data/Makefile.in are very misleading? "STRMAGIC" is just a scale
factor, not a "magic number", and it does not "turn off normalization".
"NONORMRATE" is not a "rate" in any sense I'm familiar with and
"NONORM" suggests it is turning off normalization, whereas in fact it is
involved in performing normalization. It would also be helpful if the
comment in scripts/Training.pl mentioned what the scale factor of 1024 /
32768 was trying to achieve. Initially I presumed 1024 was mistakenly
intended to be something to do with the FFT length.
Thanks,
Matt