A comparative analysis of serbian phonemes: Linear and Non-linear models

  • Danijela D. Protić Centre for Applied Mathematics and Electronics
Keywords: AR model, Neural networks, Speech,

Abstract


This paper presents the results of a comparative analysis of Serbian phonemes. The characteristics of vowels are quasi-periodicity and clearly visible formants. Non-vowels are short-term quasi-periodical signals having a low power excitation signal. For the purpose of this work, speech production systems were modelled with linear AR models and the corresponding non-linear models, based feed-forward neural networks with one hidden-layer. Sum squared error minimization as well as the back-propagation algorithm were used to train models. The selection of the optimal model was based on two stopping criteria: the normalized mean squares test error and the final prediction error. The Levenberg-Marquart method was used for the Hessian matrix calculation. The Optimal Brain Surgeon method was used for pruning. The generalization properties, based on the time-domain and signal spectra of outputs at hidden-layer neurons, are presented.

Introduction

Several years ago, NN ware applied to speech signal processing. Significant progress in this area goes towards the acceleration of the convergence of the training algorithms. Besides the choice of NN structure, the choice of transfer function is also very important. Supervised learning, input data and pre-defined outputs, require use of error or loss funciton for evaluation of deviation between expected and accurate data. Of many algorithms applied, this paper presents the BPA, with OBS pruning conducted to neural networks. The criteria to stop the pruning are minimizing NSSETEST or FPE. Vowels and non-vowels were analysed. Women and men recited phonemes or spoke them in the context of words. Results presented here are errors preceding errors, FPE gain, and cross-correlations of internal signals.

Models

When processing the speech, if only speech signal is known, AR model with approximately (2n+1)*500Hz, n = 0, 1,... poles can be used for prediction. However, if the glottal signal is also available, the AR model with extra input (ARX) can also be used. In addition, AR moving average (ARMA) is also a potential model, if the error correction is possible. There is also the problem of instability, if error is large that leads to the instability of the model. However, FNNs are convenient nonlinear models that can be pruned (i.e. process of parameter rejection) to minimize the prediction error. Stopping criteria for pruning are reaching minima of NSSETEST, NSSETRAIN or FPE. Nonlinear models are, in general, more accurate, but the process of their learning lasts much longer than fitting the AR models.

Model learning

The training of FNN and AR models are carried out by BPA for parameter changing. LM algorithm is used for Hessian matrix calculation. Optimal error step-by-step change is approximated by Taylor’s series. The second-order approximation does not indicate correlation between inputs and the output error. MATLAB methods (nnarx and marq) are used for training. The AR-10 model is also trained. Ten inputs to AR-10 is equal to the number of FNN's inputs. The estimated output is obtained based on 10 previous samples. The initial value of parameters is random. AR model are simple, stable, and do not require large computer resources. The prediction is based on the MSE. The OBS parameter pruning is used for FNN’s parameter rejection. Akaike’s FPE is used to estimate of Egen for given FNN (the number of parameters is known). To compare AR and NNAR models, the gain of FPE is introduced, e.g. it is the ratio of MSE for AR-10 and FPE for NNAR. A validation is performed for all vowels and all speakers. The same process is carried out for non-vowels.

Speech signals

Vocal and nasal tracts are parts of system for speech production whose transfer function are approximated by the acoustic filters. The air, i.e. the lung excitation, passes through the vocal and nasal tract and, depending on whether the vocal cords vibrate or not, produce vowels and non-vowels. The sound arises by radiation from the mouth and nose. The vowels are quasi-periodic over the long term signals, whose excitation is strong and vocal cords vibrate during their production. For other phonemes, quasi-periodicity is negligible; excitation is weak, low-power signal, or it is the combination of such signal with noise.

Results

For the training sets, AR-10 and FNN structure 10-3-1 were trained. The OBS pruning was performed with the maximum 20 iteration retraining after each parameter rejection. The nnprune algorithm was applied. The results were NSSE for training and test set, as well as the FPE error. This paper presents structures with stopping criterion for pruning based on NSSETEST i FPE reach minimization. The NSSE for AR-10 was also calculated. Validation was carried out by nnvalid function. For non-vowels, the FPE gain was calculated for both women and men. A new measure of distance between two signals was also presented. Spectra of signals at the outputs of hidden-layer neurons were compared. A cross-correlation analysis as well as cumulative summation of absolute values of cross-correlations for small distances were presented.

Conclusion

This paper presents a comparative analysis of serbian phonemes (vowels and non-vowels). The FNN and AR-10 models were trained and tested. The characteristics of vowels are long-term quasi-periodicity and power spectrum with clearly visible formants. Non-vowels are characterised by short quasi-periodicity and low power excitation signal. Methodology of generalization enables a choice of network architectures with improved properties, based on pruning and significant reduction in the number of model parameters. Limited architectures are characterized by a minimal number of parameters within the given margins of error. In order to reivew discriminatory properties of selected models, a new method for multi-dimensional scaling based on a measure of distance is developed. Analysis of discrimination loss suggests that the FNNs have a much higher discrimination power, which makes them usable in a wide class of speech recognition. Spectral analysis shows a good corellation of the signals at outputs of hidden-layer neurons and the input signal. The time-domain analysis indicates a week statistical dependence of these signals for the low ranks of cross-correlation (up to fifth order). Analyses indicate a slight advantage of NSSETEST compared to FPE criteria. For short training set, FPE is acceptable criterion. Results indicate that the proposed FNN models and architectures with the best generalization properties provide highly accurate models. Internally distributed structures correspond to the natural time-frequency contents of input signals and high discrimination properties for the same number of parameters, as compared to the traditional linear models.

 

Published
2014/10/10
Section
Original Scientific Papers