[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:04204] questions about energy normalization

To: hts-users <hts-users@sp.nitech.ac.jp>
Subject: [hts-users:04204] questions about energy normalization
From: Matt Shannon <sms46@cam.ac.uk>
Date: Mon, 26 Jan 2015 14:25:11 +0000
Delivered-to: hts-users@sp.nitech.ac.jp
List-help: <mailto:hts-users-ctl@sp.nitech.ac.jp?body=help>
List-id: hts-users.sp.nitech.ac.jp
List-owner: <mailto:hts-users-admin@sp.nitech.ac.jp>
List-post: <mailto:hts-users@sp.nitech.ac.jp>
List-software: fml [fml 4.0 STABLE (20040215/4.0.4_BETA)]
List-unsubscribe: <mailto:hts-users-ctl@sp.nitech.ac.jp?body=unsubscribe>
Reply-to: hts-users@sp.nitech.ac.jp
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

Hi,

I notice that the recent STRAIGHT versions of the HTS demo (2.3alpha and2.3beta) include a simple form of energy normalization, effectivelyscaling the waveforms in the training corpus to have constant rawenergy. This seems reasonable enough. However surrounding this changethere seems to be some extra scaling steps that have been added that Ihave a few questions about.

A 16-bit waveform can either be viewed as a series of floats between-1.0 and 1.0 (as MATLAB does by default) or as a series of ints between-32767 and 32768 (as the format on disk typically does). I'll definethe root mean squared value (RMSV) of a waveform x as sqrt(mean(x .* x))where x is a sequence of floats.

In data/Makefile.in each training corpus waveform is effectively scaledto have RMSV equal to 2200 before speech parameters are extracted.Training is then performed, followed by synthesis. If no additionalscaling was performed at synthesis time then the outgoing waveformswould also have RMSV somewhere around 2200 (and would clip like crazy).However an extra scaling factor of 1024 / (2200 * 32768) is included onthe outgoing waveform (in gen_wave in scripts/Training.pl), whichresults in the final synthesized waveforms having RMSV somewhere around1024 / 32768 = 0.031.

This raises a number of questions. Firstly, why is the scale factor of2200 applied before training and then unapplied before synthesis? Isthere a reason to use this at all?

Secondly, I presume the scale factor of 1024 / 32768 on the outgoingwaveform is intended to make the effective bit depth around 10 bits toallow plenty of head room so that parts of the utterance withmomentarily much larger energy do not clip? By my reckoning this issomewhere around -30 dBov. This does seem quite quiet!

Finally, could I kindly suggest that the variable name and comments indata/Makefile.in are very misleading? "STRMAGIC" is just a scalefactor, not a "magic number", and it does not "turn off normalization"."NONORMRATE" is not a "rate" in any sense I'm familiar with and"NONORM" suggests it is turning off normalization, whereas in fact it isinvolved in performing normalization. It would also be helpful if thecomment in scripts/Training.pl mentioned what the scale factor of 1024 /32768 was trying to achieve. Initially I presumed 1024 was mistakenlyintended to be something to do with the FFT length.


Thanks,

Matt

Follow-Ups:
- [hts-users:04220] Re: questions about energy normalization
  - From: Keiichiro Oura <uratec@sp.nitech.ac.jp>

Prev by Date: [hts-users:04203] Re: Creating *.lab and .utt files
Next by Date: [hts-users:04205] Re: Creating *.lab and .utt files
Previous by thread: [hts-users:04205] Re: Creating *.lab and .utt files
Next by thread: [hts-users:04220] Re: questions about energy normalization
Index(es):
- Date
- Thread