DOI QR코드

DOI QR Code

A Study on the Redundancy Reduction in Speech Recognition

음성인식에서 중복성의 저감에 대한 연구

  • 이창영 (동서대학교 시스템경영공학과)
  • Received : 2012.04.18
  • Accepted : 2012.06.07
  • Published : 2012.06.30

Abstract

The characteristic features of speech signal do not vary significantly from frame to frame. Therefore, it is advisable to reduce the redundancy involved in the similar feature vectors. The objective of this paper is to search for the optimal condition of minimum redundancy and maximum relevancy of the speech feature vectors in speech recognition. For this purpose, we realize redundancy reduction by way of a vigilance parameter and investigate the resultant effect on the speaker-independent speech recognition of isolated words by using FVQ/HMM. Experimental results showed that the number of feature vectors might be reduced by 30% without deteriorating the speech recognition accuracy.

음성 신호의 특성은 인접한 프레임에서 크게 변화하지 않는다. 따라서 비슷한 특징벡터들에 내재된 중복성을 줄이는 것이 바람직하다. 본 논문의 목적은 음성인식에 있어서 음성 특징벡터가 최소의 중복성과 최대의 유효한 정보를 갖는 조건을 찾는 것이다. 이를 이하여 우리는 하나의 감시 파라미터를 통하여 중복성 저감을 실현하고, 그 결과가 FVQ/HMM을 사용한 화자독립 음성인식에 미치는 영향을 조사하였다. 실험 결과, 인식률을 저하시키지 않고 특징벡터의 수를 30% 줄일 수 있음을 확인하였다.

Keywords

References

  1. Y. Chang, S. Hung, N. Wang, & B. Lin, "CSR: A Cloud-assisted speech recognition service for personal mobile device", International Conference on Parallel Processing (ICPP), pp. 305-314. 2011.
  2. 김범준, "와이브로 네트워크를 통한 음성 서비스의 측정 기반 품질 기준 수립", 한국전자통신 학회논문지, 6권, 6호, pp. 823-829, 2011.
  3. Spiro, G. Taylor, G. Williams, & C. Bregler, "Hands by hand: Crowd-sourced motion tracking for gesture annotation", IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 17-24. 2010.
  4. W. Sun, Z. Wu, H. Hu, & Y. Zeng, "Multi-band maximum a posteriori multi-transformation algorithm based on the discriminative combination", International Conference on Machine Learning and Cybernetics, Vol. 8, pp. 4876-4880. 2005.
  5. H. R. Tohidypour, S. A. Seyyedsalehi, H. Roshandel, & H. Behbood, "Speech recognition using three channel redundant wavelet filterbank", 2nd International Conference on Industrial Mechatronics and Automation (ICIMA), Vol. 2, pp. 325 - 328. 2010.
  6. M. Paulik & A. Waibel, "Spoken language translation from parallel speech audio: Simultaneous interpretation as SLT training data", IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 5210-5213. 2010.
  7. D. B. Pisoni, H. C. Nusbaum, & B. G. Greene, "Perception of synthetic speech generated by rule", Proceedings of the IEEE, Vol. 73, No. 11, pp. 1665-1676. 1985. https://doi.org/10.1109/PROC.1985.13346
  8. S. Alizadeh, R. Boostani, & V. Asadpour, "Lip feature extraction and reduction for HMM-based visual speech recognition systems", 9th International Conference on Signal Processing (ICSP), pp. 561-564. 2008.
  9. V. Estellers, M. Gurban, & J. P. Thiran, "Selecting relevant visual features for speechreading", IEEE International Conference on Image Processing (ICIP), pp. 1433 - 1436. 2009.
  10. Z. Tan, P. Dalsgaard, & B. Lindberg, "Adaptive Multi-Frame-Rate Scheme for Distributed Speech Recognition Based on a Half Frame-Rate Front-End", IEEE 7th Workshop on Multimedia Signal Processing, pp. 1-4. 2005.
  11. V. Sanchez, A. M. Peinado, J. L. Perez-Cordoba, "Low complexity channel error mitigation for distributed speech recognition over wireless channels", EEE International Conference on Communications, Vol. 5, pp. 3619-3623. 2003.
  12. S. M. Lajevardi & Z. M. Hussain, "Contourlet structural similarity for facial expression recognition", IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 1118-1121. 2010.
  13. T. Kim, H. Kim, W. Hwang, S. Kee, & J. Kittler, "Independent component analysis in a facial local residue space", IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 579-586. 2003.
  14. S. van Vuuren, "Comparison of text-independent speaker recognition methods on telephone speech with acoustic mismatch", Fourth International Conference on Spoken Language, Vol. 3, pp. 1788-1791. 1996.
  15. C. Jung, M. Kim, & H. Kang, "Normalized minimum-redundancy and maximum-relevancy based feature selection for speaker verification systems", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4549 - 4552. 2009.
  16. L. Granai, T. Vlachos, M. Hamouz, J. R. Tena, & T. Davies, "Model-Based Coding of 3D Head Sequences", 3DTV Conference, pp. 1-4. 2007.
  17. T. S. Tabatabaei & S. Krishnan, "Towards robust speech-based emotion recognition", IEEE International Conference on Systems Man and Cybernetics (SMC), pp. 608-611. 2010.
  18. L. Xu, M. Xu, & D. Yang, "Factor Analysis and Majority Voting Based Speech Emotion Recognition", International Conference on Intelligent System Design and Engineering Application (ISDEA), Vol. 1, pp. 716-720. 2010.
  19. M. D. Emmerson, & R. I. Damper, "Relations between fault tolerance and internal representations for multi-layer perceptrons", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 281-284. 1992.
  20. 최재승, "신경회로망에 의한 음성 및 잡음 인식 시스템", 한국전자통신학회논문지, 5권, 4호, pp. 357-362, 2010.
  21. P. Nguyen, L. Rigazio, C. Wellekens, & J.-C. Junqua, "Construction of model-space constraints", IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 69-72, 2001.
  22. J. Weng & X. Jia, "A Memory-Efficient Graph Structured Composite-State Network for Embedded Speech Recognition", Fifth International Conference on Natural Computation (ICNC), Vol. 3, pp. 570-573. 2009.
  23. M. Bouallegue, D. Matrouf, & G. Linares, "A simplified Subspace Gaussian Mixture to compact acoustic models for speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4896-4899. 2011.
  24. P. Min & S. Yihe, "ASIC design of Gabor transform for speech processing", 4th International Conference on ASIC, pp. 401-404. 2001.
  25. Y. D. Liu, Y. C. Lee, H. H. Chen, & G. Z. Sun, "Nonlinear resampling transformation for automatic speech recognition", Neural Networks for Signal Processing, pp. 319-326. 1991.
  26. G. Sarkar & G. Saha, "Efficient prequantization techniques based on probability density for speaker recognition system", IEEE Region 10 Conference (TENCON), pp. 1-6. 2009.
  27. H. Hsieh, J. Chien. K. Shinoda, & S. Furui, "Independent component analysis for noisy speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4369-4372. 2009.
  28. T. Lee & G. Jang, " The statistical structures of male and female speech signals", International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, pp. 105-108. 2001.
  29. X. Zhao, P. Yang, & L. Zhang, "Research on the low rate representations for speech signals", 11th IEEE Singapore International Conference on Communication Systems (ICCS), pp. 188-192. 2008.
  30. Y. Yangrui, Y. Hongzhi, & L. Yonghong, "The Design of Continuous Speech Corpus Based on Half-Syllable Tibetan", International Conference on Computational Intelligence and Software Engineering, pp. 1-4. 2009.
  31. C. Jung, M. Kim, & H. Kang, "Selecting Feature Frames for Automatic Speaker Recognition Using Mutual Information", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 6, pp. 1332-1340. 2010. https://doi.org/10.1109/TASL.2009.2033631
  32. J. Song, M. Lyu, J. Hwang, & M. Cai, "PVCAIS: a personal videoconference archive indexing system", International Conference on Multimedia and Expo (ICME), Vol. 2, pp. 673-676. 2003.
  33. S. Sadjadi & J. Hansen, "Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5448-5451. 2011.
  34. D. Dimitriadis, P. Maragos, & A. Potamianos, "On the Effects of Filterbank Design and Energy Computation on Robust Speech Recognition", IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, No. 6, pp. 1504-1516. 2011. https://doi.org/10.1109/TASL.2010.2092766
  35. L. Fausett, "Fundamentals of Neural Networks", Prentice-Hall, New Jersey, p. 298. 1994.
  36. M. Dehghan, K. Faez, M. Ahmadi, & M. Shridhar, "Unconstrained Farsi Handwritten Word Recognition Using Fuzzy Vector Quantization and Hidden Markov models," Pattern Recognition Letters, Vol. 22, pp. 209-214. 2001. https://doi.org/10.1016/S0167-8655(00)00090-8
  37. T. Drugman, M. Gurban, & J.-P. Thiran, "Relevant Feature Selection for Audio-Visual Speech Recognition", IEEE 9th Workshop on Multimedia Signal Processing (MMSP), pp. 179-182. 2007.