[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[hts-users:04220] Re: questions about energy normalization

To: hts-users <hts-users@sp.nitech.ac.jp>
Subject: [hts-users:04220] Re: questions about energy normalization
From: Keiichiro Oura <uratec@sp.nitech.ac.jp>
Date: Tue, 3 Feb 2015 19:16:04 +0900
Cc: Keiichiro Oura <uratec@nitech.ac.jp>
Delivered-to: hts-users@sp.nitech.ac.jp
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=VTSKYcRl6jOEm9E78h0VSYRmekxQv9BiNLkg3fH7Omo=; b=sKGcwqkumKZhy2qdBKsuJVU3x4E5DC79XnZvZ/Rs5lbedgRTI+WUZCW6AXkiQ+PEI6 MoRsmb7z+yvSON7Z2gH9IEDokDO5Jguw8tvJfEzLjdrLs8l9Z7YqNacehESvS5DR/9CJ xbgSot7LXDlVa9R1r1Ng7aeuzi9PMf31ann9NuX9xDauRDQLsbFuTSg0JluQx8bUZ0a/ I+l/PPDElNnkMFzIluqzsoSyY+7Sdx/poG8Kaz7zoUb7jULA/PiC++GfxW9D+iIGzoEX TaCqGj3KHhMAqBMCaM7z5p89DGHABf092f2g/VjURHOu6IursXg/+Na1i4lEMrRk5quF 2SLQ==
In-reply-to: <54C64E47.6010104@cam.ac.uk>
List-help: <mailto:hts-users-ctl@sp.nitech.ac.jp?body=help>
List-id: hts-users.sp.nitech.ac.jp
List-owner: <mailto:hts-users-admin@sp.nitech.ac.jp>
List-post: <mailto:hts-users@sp.nitech.ac.jp>
List-software: fml [fml 4.0 STABLE (20040215/4.0.4_BETA)]
List-unsubscribe: <mailto:hts-users-ctl@sp.nitech.ac.jp?body=unsubscribe>
References: <54C64E47.6010104@cam.ac.uk>
Reply-to: hts-users@sp.nitech.ac.jp
Sender: ura228@gmail.com

Hi,

Thank you for your report.
It seems that my understanding was not correct.
The patch for the bug is attached.

Regards,
Keiichiro Oura




2015-01-26 23:25 GMT+09:00 Matt Shannon <sms46@cam.ac.uk>:
> Hi,
>
> I notice that the recent STRAIGHT versions of the HTS demo (2.3alpha and
> 2.3beta) include a simple form of energy normalization, effectively scaling
> the waveforms in the training corpus to have constant raw energy.  This
> seems reasonable enough.  However surrounding this change there seems to be
> some extra scaling steps that have been added that I have a few questions
> about.
>
> A 16-bit waveform can either be viewed as a series of floats between -1.0
> and 1.0 (as MATLAB does by default) or as a series of ints between -32767
> and 32768 (as the format on disk typically does).  I'll define the root mean
> squared value (RMSV) of a waveform x as sqrt(mean(x .* x)) where x is a
> sequence of floats.
>
> In data/Makefile.in each training corpus waveform is effectively scaled to
> have RMSV equal to 2200 before speech parameters are extracted. Training is
> then performed, followed by synthesis.  If no additional scaling was
> performed at synthesis time then the outgoing waveforms would also have RMSV
> somewhere around 2200 (and would clip like crazy). However an extra scaling
> factor of 1024 / (2200 * 32768) is included on the outgoing waveform (in
> gen_wave in scripts/Training.pl), which results in the final synthesized
> waveforms having RMSV somewhere around 1024 / 32768 = 0.031.
>
> This raises a number of questions.  Firstly, why is the scale factor of 2200
> applied before training and then unapplied before synthesis?  Is there a
> reason to use this at all?
>
> Secondly, I presume the scale factor of 1024 / 32768 on the outgoing
> waveform is intended to make the effective bit depth around 10 bits to allow
> plenty of head room so that parts of the utterance with momentarily much
> larger energy do not clip?  By my reckoning this is somewhere around -30
> dBov.  This does seem quite quiet!
>
> Finally, could I kindly suggest that the variable name and comments in
> data/Makefile.in are very misleading?  "STRMAGIC" is just a scale factor,
> not a "magic number", and it does not "turn off normalization".
> "NONORMRATE" is not a "rate" in any sense I'm familiar with and "NONORM"
> suggests it is turning off normalization, whereas in fact it is involved in
> performing normalization.  It would also be helpful if the comment in
> scripts/Training.pl mentioned what the scale factor of 1024 / 32768 was
> trying to achieve.  Initially I presumed 1024 was mistakenly intended to be
> something to do with the FFT length.
>
> Thanks,
>
> Matt
>

Attachment: tmp.patch
Description: Binary data

Follow-Ups:
- [hts-users:04222] Re: questions about energy normalization
  - From: Matt Shannon <sms46@cam.ac.uk>

References:
- [hts-users:04204] questions about energy normalization
  - From: Matt Shannon <sms46@cam.ac.uk>

Prev by Date: [hts-users:04219] Re: Fwd: OSSIAN installation
Next by Date: [hts-users:04221] [spam] Re: Re: Help require to generate speech using forced-alignment label of natural speech
Previous by thread: [hts-users:04204] questions about energy normalization
Next by thread: [hts-users:04222] Re: questions about energy normalization
Index(es):
- Date
- Thread