Post under construction

Introduction

Kaldi “nnet3” is a robust framework for DNN acoustic modelling. In almost all the recipes, you can find examples of different configuration that can be adapted to use it in your own task. However, to understand how to adapt the xconfig file to implement more sophisticated (and not too sophisticated sometimes) ideas is not a process.

nnet3 structure

Nnet3 neural network is constructed using a general graph structure consisting in:

  • A list of Components
  • A graph structure that specify how the Components are connected

The network construction is based-on a config file where the Components, nodes, inputs and outputs are defined.

TODO- add Index and Cindex descripton

xconfig to config

The xconfig are simplified configuration files to define the structure of the network. This files are parse by using the script steps/nnet3/xconfig_to_configs.py passing the xconfig file and output path. e.g.

config_dir=etc/chain/tdnn/configs
steps/nnet3/xconfig_to_configs.py --xconfig-file $config_dir/network.xconfig \
                                  --config-dir $config_dir/configs/ 

Layers

This is an explanation how a line in xconfig is parse into the final configuration of the network. Kaldi groups the layers into several kinds. I will list all of the layers. But, I will only detail some of them.

basic_layers basic_layer.py
  • input
  • output (not real outputs, they just directly map to an output-node in nnet3)
  • output_layer (real output layer)
  • relu-layer
  • relu-renorm-layer
  • relu-batchnorm-dropout-layer
  • relu-dropout-layer
  • relu-batchnorm-layer
  • relu-batchnorm-so-layer
  • batchnorm-so-relu-layer
  • batchnorm-layer
  • sigmoid-layer
  • tanh-layer
  • fixed-affine-layer (is an affine transform that is supplied at network initialization time and is not trainable)
  • affine-layer (fully connected layer)
  • idct-layer (to convert input MFCC-features to Filterbank features)
  • spec-augment-layer
convolution convolution.py
  • conv-batchnorm-layer
  • conv-renorm-layer
  • res-block (residual block as in ResNets)
  • res2-block (residual block with post-activations, with no support downsampling)
  • SumBlockComponent (For channel averaging)
attention convolution.py
  • attention-renorm-layer
  • attention-relu-renorm-layer
  • attention-relu-batchnorm-layer
  • relu-renorm-attention-layer
  • SumBlockComponent (For channel averaging)
  • or any combination of relu, attention, sigmoid, tanh, renorm, batchnorm, dropout
lstm lstm.py
  • lstm-layer
  • lstmp-layer
  • lstmp-batchnorm-layer (followed by batchnorm)
  • fast-lstm-layer
  • fast-lstm-batchnorm-layer (followed by batchnorm)
  • lstmb-layer
  • fast-lstmp-layer
  • fast-lstmp-batchnorm-layer
gru gru.py
  • gru-layer (Gated recurrent unit)
  • pgru-layer (Personalized Gated Recurrent Unit)
  • norm-pgru-layer (batchnorm in the forward direction, renorm in the recurrence)
  • opgru-layer (Output-Gate Projected Gated Recurrent Unit) paper
  • norm-opgru-layer (batchnorm in the forward direction, renorm in the recurrence)
  • fast-gru-layer
  • fast-pgru-layer
  • fast-norm-pgru-layer (batchnorm in the forward direction, renorm in the recurrence)
  • fast-opgru-layer
  • fast-norm-opgru-layer
stats_layer stats_layer.py
  • stats-layer (adds statistics-pooling and statistics-extraction components)
trivial_layers trivial_layers.py
  • renorm-component
  • batchnorm-component
  • no-op-component
  • delta-layer
  • linear-component
  • combine-feature-maps-layer
  • affine-component
  • scale-component
  • offset-component
  • dim-range-component
composite_layers composite_layers.py
  • tdnnf-layer (factorized TDNN)
  • prefinal-layer

How to..

The order to construct a network definition in kaldi is first, define the network.xconfig file. Second, parse the xconfig to config with steps/nnet3/xconfig_to_configs.py script. Then, run the steps/nnet3/chain/train.py. This pipeline is assuming that all the features and egs files already exists.

The network.xconfig file construction

One way to construct the xconfig file is inserting the lines into the network.xconfig file directly from the run_net.sh script file. In this way, you will be able to set the different parameters using variables, keeping the script organized and easy to modify.

dir=exp/chain/tdnn_sp
mkdir -p $dir/configs

# Definition of the  the xconfig
cat <<EOF > $dir/configs/network.xconfig
  input ...
  ...
  output-layer ...
EOF
# Parse xconfig to final, init and ref configs
steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig \
                                  --config-dir $dir/configs/

Input layer

In the Kaldi recipes, it is common that the dimension of the input layer is 40 MFCC. In some cases, this is a hard value for the dim parameter in the input layer definition. But sometimes, you may want to experiment with vectors of different size. Therefore, it would be more convenient to have a dynamic value that automatically take the vector size. You can get the vector size by calling feat-to-dim function as:

# Getting features vector dimension
feat_path=data/train_clean_sp_hires
feat_dim=`feat-to-dim scp:${feat_path}/feats.scp -`

# Using the feat_dim in xconfig
input dim=$feat_dim name=input

MFCC features

The most basic input layer in xconfig would be defined with the layer input. It is important that set the name of this layer as input.

 input dim=40 name=input

MFCC + iVectors features

If you want to concatenate iVectors with the MFCC, you need to define another input layer called ivector and a fixed-affine-layer. In the following example, the notation inside of the Append function assumes that exists an input-layer named as input, and it will replace the -1,0,1 notation to input[-1], input[0], input[1].

 input dim=100 name=ivector
 input dim=40 name=input

 # please note that it is important to have input layer with the name=input
 # as the layer immediately preceding the fixed-affine-layer to enable
 # the use of short notation for the descriptor
 fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=foo/lda.mat

Using Filterbanks

Kaldi preffers to save MFCC features because are more condense than the filterbanks features. So, if you, for example, want to train a cnn-tdnn network, you need to transform the MFCC to filterbanks to train the CNN part. To avoid to store both kind of features, in Kaldi exist the idct-layer that converts the MFCC into Filterbanks.

 input dim=40 name=input
 idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat

In the case of the CNN-TDNN example, the order of the layers is important. Probably, you should think about it as a convention.

 input dim=100 name=ivector
 input dim=40 name=input
 
# please note that it is important to have input layer with the name=input
 # as the layer immediately preceding the fixed-affine-layer to enable
 # the use of short notation for the descriptor
 fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
 idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat

Multiview features

In some scenarios, you may want to add different levels of features, e.g. frame, utterance, speaker, recording party, so on.. To do this you can concatenate the features as:

TODO add the example

Multi-task

I will not explain here how to construct a multi-task learning but, Josh Meyer has a nice template you can follow. https://github.com/JRMeyer/multi-task-kaldi

TDNN layers

The following is an example of a common tdnn definition from librispeech recipe.

relu_dim=725
num_targets=$(tree-info $tree_dir/tree |grep num-pdfs|awk '{print $2}')
learning_rate_factor=$(echo "print (0.5/$xent_regularize)" | python)

cat <<EOF > $dir/configs/network.xconfig
  input dim=100 name=ivector
  input dim=40 name=input

  # please note that it is important to have input layer with the name=input
  # as the layer immediately preceding the fixed-affine-layer to enable
  # the use of short notation for the descriptor
  fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat

  # the first splicing is moved before the lda layer, so no splicing here
  relu-batchnorm-layer name=tdnn1 dim=$relu_dim
  relu-batchnorm-layer name=tdnn2 dim=$relu_dim input=Append(-1,0,1,2)
  relu-batchnorm-layer name=tdnn3 dim=$relu_dim input=Append(-3,0,3)
  relu-batchnorm-layer name=tdnn4 dim=$relu_dim input=Append(-3,0,3)
  relu-batchnorm-layer name=tdnn5 dim=$relu_dim input=Append(-3,0,3)
  relu-batchnorm-layer name=tdnn6 dim=$relu_dim input=Append(-6,-3,0)

  ## adding the layers for chain branch
  relu-batchnorm-layer name=prefinal-chain dim=$relu_dim target-rms=0.5
  output-layer name=output include-log-softmax=false dim=$num_targets max-change=1.5

  # adding the layers for xent branch
  # This block prints the configs for a separate output that will be
  # trained with a cross-entropy objective in the 'chain' models... this
  # has the effect of regularizing the hidden parts of the model.  we use
  # 0.5 / args.xent_regularize as the learning rate factor- the factor of
  # 0.5 / args.xent_regularize is suitable as it means the xent
  # final-layer learns at a rate independent of the regularization
  # constant; and the 0.5 was tuned so as to make the relative progress
  # similar in the xent and regular final layers.
  relu-batchnorm-layer name=prefinal-xent input=tdnn6 dim=$relu_dim target-rms=0.5
  output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor max-change=1.5
EOF
  steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/


Terminology

Some of the terms have a link to the definition on the deepai.org website.

Documentation of Components

If you check the final.config file after parse the xconfig, you will see that several components are inserted. Many of then are implicit when the definition of the network.

  1. The NaturalGradientAffineComponent component is the Natural Gradient for Stochastic Gradient Descent described in paper.

  2. The LinearComponent represents a linear (matrix) transformation of its input, with a matrix as its trainable parameters. It’s the same as NaturalGradientAffineComponent, but without the bias term.

  3. The TdnnComponent is a more memory-efficient alternative to manually splicing several frames of input.