DOI QR코드

DOI QR Code

Analysis of Korean Language Parsing System and Speed Improvement of Machine Learning using Feature Module

한국어 의존 관계 분석과 자질 집합 분할을 이용한 기계학습의 성능 개선

  • Received : 2014.07.09
  • Accepted : 2014.07.31
  • Published : 2014.08.25

Abstract

Recently a variety of study of Korean parsing system is carried out by many software engineers and linguists. The parsing system mainly uses the method of machine learning or symbol processing paradigm. But the parsing system using machine learning has long training time because the data of Korean sentence is very big. And the system shows the limited recognition rate because the data has self error. In this thesis we design system using feature module which can reduce training time and analyze the recognized rate each the number of training sentences and repetition times. The designed system uses the separated modules and sorted table for binary search. We use the refined 36,090 sentences which is extracted by Sejong Corpus. The training time is decreased about three hours and the comparison of recognized rate is the highest as 84.54% when 10,000 sentences is trained 50 times. When all training sentence(32,481) is trained 10 times, the recognition rate is 82.99%. As a result it is more efficient that the system is used the refined data and is repeated the training until it became the steady state.

최근에 한국어 의존 관계에 대한 파싱 시스템과 관련된 연구가 소프트웨어 공학자들이나 언어학자들에 의해 다양하게 연구되고 있으며, 시스템 구현은 주로 기계 학습이나 기호 주의를 사용하고 있다. 기계 학습을 사용한 방법은 한국어 문장 데이터가 매우 크기 때문에 시스템 특성상 매우 긴 학습시간을 가지며, 데이터 자체가 가지는 오류로 인하여 한정된 인식율을 가진다. 본 연구에서는 기계학습을 이용한 시스템에 대하여 학습 시간을 줄일 수 있도록 특징들을 자질 집합 모듈로 분할하여 처리하는 방법을 제안하고, 문장수와 반복횟수에 따른 인식율을 분석하였다. 설계된 시스템은 분리된 모듈과 이진 검색을 위한 정렬 기법이 사용되었다. 데이터는 세종 말뭉치로부터 추출한 후 정제된 36,090문장을 사용하였다. 학습 시간은 약 3시간으로 줄었으며, 인식율은 10,000 문장을 50회 학습하였을 때 84.54%로 가장 높았다. 모든 학습 문장(32,481)을 10회 학습하였을 때 인식율은 82.99%이다. 결과적으로 정제된 데이터를 이용하여 시스템이 안정화될 때까지 반복하는 것이 더 효율적이었다.

Keywords

References

  1. Geunbae Lee, "Comparison of connectionism and Symbolism in Natural Lanuage Processing", The Journal of KIISE, pp. 1230-1238. 1993.
  2. Newell, A, "Physical sysbol systems", Cognitive science, 4, pp. 135-183.
  3. Miikkulainen, R. and Dyer, M. G, "Natural Language processing with modular neural networks and distributed lexicon", Convitive Science, 15, pp. 343-399.
  4. Hinton, G. E., McClelland, J. L., and RumelHart, D. E., "Distributed representation. - Parallel Distributed Processing: Exploratons in the Microstructure of Cognition.", Vol I. pages 77-109, MIT Press, Cambridge, MA.
  5. Waltz, D. L., Pollack, J. B., "Massively parallel parsing", Cognivive Science, 9. pp. 51-74.
  6. M. Collins, "Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms," Proc. of EMNLP, 2002.
  7. Yoav Freund, Robert E. Schapire, "Large Margin Classification Using the Perceptron Algorithm", Machine Learning Vo. 37. 277-296, 1999. https://doi.org/10.1023/A:1007662407062
  8. Kwangmo Ahn, Younghoon Seo, "A Korean Dependency Parsing Algorithm using Sets of Head Candidates", the journal of KIISE, Vol 41. pp. 88-95, 2014.
  9. S. Bucholz, E. Marsi, "CoNLL-X shared task on Multilingual Dependency Parsing", Proc. of CoNLL, pp.149-164, 2006.
  10. R. McDonald, K. Crammar, F. Pereira, "Online Largemargin Training of Dependency Parsers," Proc. of ACL, pp.91-98, 2005.
  11. J. Nivre, "An Efficient Algorithm for Projective Dependency Parsing," Proc. of IWPT, pp.149-160, 2003.
  12. R. McDonald, F. Pereira, "Online Learning of approximate dependency parsing algorithms", Proc, of EACL, 2006.
  13. Soojong Lim, Youngtae Kim, Dongyul Ra, "Korean Dependency Parsing Based on Machine Learning of Feature Weights", the journal of KIISE, Vol 38. 4, pp. 214-223, 2011.
  14. Youngmin Park, Jungyun Seo, "Segang Korean dependency Analyzer", competitive exhibition of 2011 Korean Information Processing System, 2011.
  15. J.H. Kim, "A Study on a Corpus Construction Tool for Machine Translation", Research Report, Electronics and Telecommunications Research Institute (ETRI), 2012.
  16. H.G. Kim, "21st Century Sejong Project Construction of the Primary Data of the Korean Language", Research Report NIKL 2007-01-10, National Institute of the Korean Language, 2007.
  17. Youngsook Hwang, Hoojung Chung, Soyoung Park, YoungJae Kwak, Haechang Rim, "Improving the Performance of Korean Text Chunking by Machine Learning Approaches based on Feature Set Selection", the journal of KIISE, Vol. 29. pp. 654-668. 2002.
  18. Yonghun Lee, JongHyeok Lee, "Korean Dependency Parsing Using Online Learning", the Conference of Korea Computer Congress 2014, Vol. 37., No, 1, 2010.
  19. Myunggil Choi, Hyungwon Seo, Hongseok Kwon, Jaehoon Kim, "Detecting and correcting errors in Korean POS-tagged corpora", the Journal of KOSME, Vol. 37, pp. 227-235. 2013. https://doi.org/10.5916/jkosme.2013.37.2.227
  20. Youngkuk Hong, Jonghyuk Hong, Geunbae Lee, "A Korean Syntactic Analyzer based on the Dependency Grammar", the Conference of KIISE, Vol. 20. pp. 781-784. 1993.
  21. Joonchoul Shin, Cheolyoung Ock, ,"A Korean Morphological Analyzer using a Pre-analyzed Partial Word-phrase Dictionary", the journal of KIISE, Vol 39, pp. 415-424, 2012.