The MDL criterion is an effective way to select
the optimal probabilistic model from among various models. When
used for decision tree clustering in HMM-TTS the ML criterion
stays but MDL penalises larger trees.
Suppose we are given a sequence of N data points x = {x1, . .
. , xN }. As an estimation problem, we could say that we are
looking for the model that has generated this data. In other
words, we try to estimate a vector of parameters θ = [θ1, . .
. , θL] of a statistical model Pθ(x) for the data x. The MDL
criterion is an effective way to select the optimal
probabilistic model from among various models. In order to do
that, it selects the statistical model with the minimum
description length for the given data. The description length
Dj(x) for data x of an underlying probabilistic model j is
given by,
where:
• θˆ(j) represents the ML estimate of model j for the vector
of parameters θ.
• Lj is the number of parameters of θˆ(j) in probabilistic
model j.
One of the advantages of the MDL criterion is that the
second term defined in the equation works as a penalty
imposed for employing a large model size. So, as a model
becomes more complex, the value of the first term decreases
and that of the second term increases.
See:
Rissanen, J. (1984). Universal coding, information,
prediction, and estimation. IEEE Transactions on Information
Theory, 30:629–636.