[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:04222] Re: questions about energy normalization



Hi Keiichiro,

Thanks very much for getting back to me.  The new code looks great.

I now understand that the original code was attempting to cancel out the effects of energy normalization performed by STRAIGHT (though this is unnecessary when calling exstraightspec) rather than trying to do its own energy normalization.

Thanks,

Matt


On 03/02/15 10:16, Keiichiro Oura wrote:
Hi,

Thank you for your report.
It seems that my understanding was not correct.
The patch for the bug is attached.

Regards,
Keiichiro Oura




2015-01-26 23:25 GMT+09:00 Matt Shannon <sms46@cam.ac.uk>:
Hi,

I notice that the recent STRAIGHT versions of the HTS demo (2.3alpha and
2.3beta) include a simple form of energy normalization, effectively scaling
the waveforms in the training corpus to have constant raw energy.  This
seems reasonable enough.  However surrounding this change there seems to be
some extra scaling steps that have been added that I have a few questions
about.

A 16-bit waveform can either be viewed as a series of floats between -1.0
and 1.0 (as MATLAB does by default) or as a series of ints between -32767
and 32768 (as the format on disk typically does).  I'll define the root mean
squared value (RMSV) of a waveform x as sqrt(mean(x .* x)) where x is a
sequence of floats.

In data/Makefile.in each training corpus waveform is effectively scaled to
have RMSV equal to 2200 before speech parameters are extracted. Training is
then performed, followed by synthesis.  If no additional scaling was
performed at synthesis time then the outgoing waveforms would also have RMSV
somewhere around 2200 (and would clip like crazy). However an extra scaling
factor of 1024 / (2200 * 32768) is included on the outgoing waveform (in
gen_wave in scripts/Training.pl), which results in the final synthesized
waveforms having RMSV somewhere around 1024 / 32768 = 0.031.

This raises a number of questions.  Firstly, why is the scale factor of 2200
applied before training and then unapplied before synthesis?  Is there a
reason to use this at all?

Secondly, I presume the scale factor of 1024 / 32768 on the outgoing
waveform is intended to make the effective bit depth around 10 bits to allow
plenty of head room so that parts of the utterance with momentarily much
larger energy do not clip?  By my reckoning this is somewhere around -30
dBov.  This does seem quite quiet!

Finally, could I kindly suggest that the variable name and comments in
data/Makefile.in are very misleading?  "STRMAGIC" is just a scale factor,
not a "magic number", and it does not "turn off normalization".
"NONORMRATE" is not a "rate" in any sense I'm familiar with and "NONORM"
suggests it is turning off normalization, whereas in fact it is involved in
performing normalization.  It would also be helpful if the comment in
scripts/Training.pl mentioned what the scale factor of 1024 / 32768 was
trying to achieve.  Initially I presumed 1024 was mistakenly intended to be
something to do with the FFT length.

Thanks,

Matt