[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:04220] Re: questions about energy normalization



Hi,

Thank you for your report.
It seems that my understanding was not correct.
The patch for the bug is attached.

Regards,
Keiichiro Oura




2015-01-26 23:25 GMT+09:00 Matt Shannon <sms46@cam.ac.uk>:
> Hi,
>
> I notice that the recent STRAIGHT versions of the HTS demo (2.3alpha and
> 2.3beta) include a simple form of energy normalization, effectively scaling
> the waveforms in the training corpus to have constant raw energy.  This
> seems reasonable enough.  However surrounding this change there seems to be
> some extra scaling steps that have been added that I have a few questions
> about.
>
> A 16-bit waveform can either be viewed as a series of floats between -1.0
> and 1.0 (as MATLAB does by default) or as a series of ints between -32767
> and 32768 (as the format on disk typically does).  I'll define the root mean
> squared value (RMSV) of a waveform x as sqrt(mean(x .* x)) where x is a
> sequence of floats.
>
> In data/Makefile.in each training corpus waveform is effectively scaled to
> have RMSV equal to 2200 before speech parameters are extracted. Training is
> then performed, followed by synthesis.  If no additional scaling was
> performed at synthesis time then the outgoing waveforms would also have RMSV
> somewhere around 2200 (and would clip like crazy). However an extra scaling
> factor of 1024 / (2200 * 32768) is included on the outgoing waveform (in
> gen_wave in scripts/Training.pl), which results in the final synthesized
> waveforms having RMSV somewhere around 1024 / 32768 = 0.031.
>
> This raises a number of questions.  Firstly, why is the scale factor of 2200
> applied before training and then unapplied before synthesis?  Is there a
> reason to use this at all?
>
> Secondly, I presume the scale factor of 1024 / 32768 on the outgoing
> waveform is intended to make the effective bit depth around 10 bits to allow
> plenty of head room so that parts of the utterance with momentarily much
> larger energy do not clip?  By my reckoning this is somewhere around -30
> dBov.  This does seem quite quiet!
>
> Finally, could I kindly suggest that the variable name and comments in
> data/Makefile.in are very misleading?  "STRMAGIC" is just a scale factor,
> not a "magic number", and it does not "turn off normalization".
> "NONORMRATE" is not a "rate" in any sense I'm familiar with and "NONORM"
> suggests it is turning off normalization, whereas in fact it is involved in
> performing normalization.  It would also be helpful if the comment in
> scripts/Training.pl mentioned what the scale factor of 1024 / 32768 was
> trying to achieve.  Initially I presumed 1024 was mistakenly intended to be
> something to do with the FFT length.
>
> Thanks,
>
> Matt
>

Attachment: tmp.patch
Description: Binary data