Hi, Thank you for your report. It seems that my understanding was not correct. The patch for the bug is attached. Regards, Keiichiro Oura 2015-01-26 23:25 GMT+09:00 Matt Shannon <sms46@cam.ac.uk>: > Hi, > > I notice that the recent STRAIGHT versions of the HTS demo (2.3alpha and > 2.3beta) include a simple form of energy normalization, effectively scaling > the waveforms in the training corpus to have constant raw energy. This > seems reasonable enough. However surrounding this change there seems to be > some extra scaling steps that have been added that I have a few questions > about. > > A 16-bit waveform can either be viewed as a series of floats between -1.0 > and 1.0 (as MATLAB does by default) or as a series of ints between -32767 > and 32768 (as the format on disk typically does). I'll define the root mean > squared value (RMSV) of a waveform x as sqrt(mean(x .* x)) where x is a > sequence of floats. > > In data/Makefile.in each training corpus waveform is effectively scaled to > have RMSV equal to 2200 before speech parameters are extracted. Training is > then performed, followed by synthesis. If no additional scaling was > performed at synthesis time then the outgoing waveforms would also have RMSV > somewhere around 2200 (and would clip like crazy). However an extra scaling > factor of 1024 / (2200 * 32768) is included on the outgoing waveform (in > gen_wave in scripts/Training.pl), which results in the final synthesized > waveforms having RMSV somewhere around 1024 / 32768 = 0.031. > > This raises a number of questions. Firstly, why is the scale factor of 2200 > applied before training and then unapplied before synthesis? Is there a > reason to use this at all? > > Secondly, I presume the scale factor of 1024 / 32768 on the outgoing > waveform is intended to make the effective bit depth around 10 bits to allow > plenty of head room so that parts of the utterance with momentarily much > larger energy do not clip? By my reckoning this is somewhere around -30 > dBov. This does seem quite quiet! > > Finally, could I kindly suggest that the variable name and comments in > data/Makefile.in are very misleading? "STRMAGIC" is just a scale factor, > not a "magic number", and it does not "turn off normalization". > "NONORMRATE" is not a "rate" in any sense I'm familiar with and "NONORM" > suggests it is turning off normalization, whereas in fact it is involved in > performing normalization. It would also be helpful if the comment in > scripts/Training.pl mentioned what the scale factor of 1024 / 32768 was > trying to achieve. Initially I presumed 1024 was mistakenly intended to be > something to do with the FFT length. > > Thanks, > > Matt >
Attachment:
tmp.patch
Description: Binary data