DOI QR코드

DOI QR Code

Improvement of Naturalness for a HMM-based Korean TTS using the prosodic boundary information

운율경계정보를 이용한 HMM기반 한국어 TTS 자연성 향상 연구

  • Lim, Gi-Jeong (School of Electrical Engineering, University of Ulsan) ;
  • Lee, Jung-Chul (School of Electrical Engineering, University of Ulsan)
  • Received : 2012.07.31
  • Accepted : 2012.08.30
  • Published : 2012.09.30

Abstract

HMM-based Text-to-Speech systems generally utilize context dependent tri-phone units from a large corpus speech DB to enhance the synthetic speech. To downsize a large corpus speech DB, acoustically similar tri-phone units are clustered based on the decision tree using context dependent information. Context dependent information includes phoneme sequence as well as prosodic information because the naturalness of synthetic speech highly depends on the prosody such as pause, intonation pattern, and segmental duration. However, if the prosodic information was complicated, many context dependent phonemes would have no examples in the training data, and clustering would provide a smoothed feature which will generate unnatural synthetic speech. In this paper, instead of complicate prosodic information we propose a simple three prosodic boundary types and decision tree questions that use rising tone, falling tone, and monotonic tone to improve naturalness. Experimental results show that our proposed method can improve naturalness of a HMM-based Korean TTS and get high MOS in the perception test.

HMM 기반 음성합성시스템은 성능향상을 위해 일반적으로 대용량 음성 DB로부터 생성된 문맥의존 tri-phone을 이용한다. 그리고 대용량 DB의 경량화를 위해서 문맥의존정보를 이용하여 결정트리 방식으로 발화특성이 유사한 문맥의존음소들을 군집화한다. 군집화에 사용하는 문맥의존정보는 음소열 뿐만 아니라 운율정보도 포함하는데 이는 합성음의 자연성이 끊어 읽기, 억양패턴, 음의 장단과 같은 운율에 의해 크게 좌우되기 때문이다. 그러나 복잡한 운율정보를 사용할 경우 훈련과정에 포함되지 않은 문맥의존음소는 하나의 대표값으로 평활화되며 이로 인해 합성음의 자연성이 크게 저하된다. 본 논문에서는 합성음의 자연성을 향상시키기 위해 복잡한 운율정보 대신 억양 변화를 상승, 평탄, 하강으로 구분함으로써 운율정보표현을 간소화시킨 운율경계정보를 포함하는 문맥의존정보에 대한 문맥질의, 그리고 해당 질의의 패턴을 정의하는 방법을 제안하였다. 본 논문에서 제안하는 세 가지 운율경계정보를 포함한 문맥의존정보를 이용하여 합성음을 생성하고 MOS평가를 수행한 결과 운율경계정보를 이용한 HMM기반 한국어 TTS 합성음의 자연성이 향상됨을 확인하였다.

Keywords

References

  1. K. Tokuda, H. Zen, and A.W. Black, "An HMM based approach to multilingual speech synthesis," Text to speech synthesis: New paradigms and advances, S. Narayanan, A. Alwan (Eds.), Prentice Hall, pp.135-153, Aug. 2004.
  2. A.W. Black, H. Zen, and K. Tokuda, "Statistical parametric speech synthesis," Proc. ICASSP 2007, vol. 4, pp. 1229-1232, Apr. 2007.
  3. H.C. Lee, and J.M. Seo, " A study of Implementing An Embedded System for Conversion from Text to Speech ," Journal of the Korea Society of Computer and Information, v.13, no.3, pp.77-83, May 2008.
  4. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.-Y. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, "The Hidden Markov Model Toolkit (HTK)," http://htk.eng.cam.ac.uk/
  5. K. Tokuda, H. Zen, J. Yamagishi, T. Masuko, S. Sako, A.W. Black, and T. Nose, "The HMM based speech synthesis system (HTS)," http://hts.sp.nitech.ac.jp/
  6. A.W. Black, P. Taylor, and R. Caley, "The festival speech synthesis system," http://www.festvox.org/ festival/
  7. S. Kim, J. Kim, and M. Hahn, "HMM-Based Korean Speech Synthesis System for Hand-held Devices," IEEE Trans. Consumer Electronics, vol. 52, no. 4, pp.1384-1390, Nov. 2006. https://doi.org/10.1109/TCE.2006.273160
  8. J. Lee, "A Tree-based Reduction of Speech DB in a Large Corpus-based Korean TTS," Journal of the Korea Society of Computer and Information, v.15, no.7, pp.91-98, Jul. 2010. https://doi.org/10.9708/jksci.2010.15.7.091
  9. S. Imai, "Cepstral analysis synthesison the melfrequency scale," Proc. ICASSP, vol. 1, pp. 93-96, Apr. 1983.
  10. K. Tokuda, T. Masuko, T. Yamada, T. Kobayashi and S. Imai, "An Algorithm for Speech Parameter Generation from Continuous Mixture HMMs with Dynamic Features," Proc. of EUROSPEECH,vol. 1, pp. 757-760, Sep. 1995.
  11. J. Latorre, and et. al., "Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?," Proc. ICASSP, pp. 4724-4727, May 2011.
  12. Q. Zhang, F. Soong, Y. Qian, Z. Yan, J. Pan, and Y. Yan, "Improved modeling for FO generation and V /U decision in HMM-based TTS," Proc. ICASSP, pp. 4606-4609, Mar. 2010.
  13. K. Tokuda, T. Mausko, N. Miyazaki, and T. Kobayashi, "Multi-space probability distribution HMM (Invited paper)," IEICE Trans. Inf. & Syst., vol. E85-D, no. 3, pp.455-464, Mar. 2002
  14. S. J. Young, J. J. Odell, and P. C. Woodland, "Tree-based state tying for high accuracy acoustic modeling," Proc. ARPA Human Language Technology Workshop, pp. 307-312, Mar. 1994.
  15. K. Shinoda and T. Watanabe, "MDL-based contextdependent subword modeling for speech recognition," J. Acoust. Soc. Jpn.(E), vol.21, no.2, pp. 79-86, Feb. 2000. https://doi.org/10.1250/ast.21.79
  16. K. Shinoda and T. Watanabe, "Acoustic modeling based on the MDL criterion for speech recognition," Proc. Eurospeech, vol. 1, pp. 99-102, Sep. 1997.

Cited by

  1. How to Express Emotion: Role of Prosody and Voice Quality Parameters vol.19, pp.11, 2012, https://doi.org/10.9708/jksci.2014.19.11.159
  2. 음성 합성 시스템의 품질 향상을 위한 한국어 문장 기호 전처리 시스템 vol.20, pp.2, 2012, https://doi.org/10.9708/jksci.2015.20.2.149