DOI QR코드

DOI QR Code

Korean Language Clustering using Word2Vec

Word2Vec를 이용한 한국어 단어 군집화 기법

  • Heu, Jee-Uk (Dept. of Computer Engineering, Hanyang University)
  • 허지욱 (한양대학교 컴퓨터공학과)
  • Received : 2018.08.21
  • Accepted : 2018.10.05
  • Published : 2018.10.31

Abstract

Recently with the development of Internet technology, a lot of research area such as retrieval and extracting data have getting important for providing the information efficiently and quickly. Especially, the technique of analyzing and finding the semantic similar words for given korean word such as compound words or generated newly is necessary because it is not easy to catch the meaning or semantic about them. To handle of this problem, word clustering is one of the technique which is grouping the similar words of given word. In this paper, we proposed the korean language clustering technique that clusters the similar words by embedding the words using Word2Vec from the given documents.

최근 인터넷의 발전과 함께 사용자들이 원하는 정보를 빠르게 획득하기 위해서는 효율적인 검색 결과를 제공해주는 정보검색이나 데이터 추출등과 같은 연구 분야에 대한 중요성이 점점 커지고 있다. 하지만 새롭게 생겨나는 한국어 단어나 유행어들은 의미파악하기가 어렵기 때문에 주어진 단어와 의미적으로 유사한 단어들을 찾아 분석하는 기법들에 대한 연구가 필요하다. 이를 해결하기 위한 방법 중 하나인 단어 군집화 기법은 문서에서 주어진 단어와 의미상 유사한 단어들을 찾아서 묶어주는 기법이다. 본 논문에서는 Word2Vec기법을 이용하여 주어진 한글 문서의 단어들을 임베딩하여 자동적으로 유사한 한국어 단어들을 군집화 하는 기법을 제안한다.

Keywords

References

  1. M. Sun, H, Um, "The Study on Recent Research Trend in Korean Tourism Using Keyword Network Analysis," Journal of the Korea Academia- Industrial cooperation Society(JKAIS), Vol. 17, No. 9, pp. 68-73, 2016. https://doi.org/10.5762/KAIS.2016.17.9.68
  2. E. Bae, S. Yu, "Keyword-based Recommender System Dataset Construction and Analysis, "Journal of KIIT. Vol. 16, No. 6, pp. 91-99, 2018. DOI : 10.14801/jkiit.2018.16.6.91.
  3. http://www.bloter.net/archives/260569
  4. Jae-Young Chang, "A Study on Research Trends of Graph-Based Text Representations for Text Mining", The Journal of The Institute of Internet, Broadcasting and Communication, Vol. 13, No. 5, pp. 37-47, Oct 2013. DOI: http://dx.doi.org/10.7236/JIIBC.2013.13.5.37
  5. Shirai, Kiyoaki, and Makoto Nakamura. "JAIST: Clustering and classification based approaches for Japanese WSD." Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, pp. 379-382, 2010.
  6. Chen, Qian, Zengru Jiang, and Jinqiang Bian. "Chinese keyword extraction using semantically weighted network." In Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2014 Sixth International Conference on, Vol. 2, pp. 83-86. IEEE, 2014.
  7. Xu, G. X., W. Sun, and X. P. Peng. "Clustering Research across Tibetan and Chinese Texts." Journal of Digital Information Management Vol. 13, No. 3, pp. 163-168, 2015
  8. Abuaiadah, Diab, Dileep Rajendran, and Mustafa Jarrar. "Clustering Arabic tweets for sentiment analysis." In Computer Systems and Applications (AICCSA), 2017 IEEE/ACS 14th International Conference on, pp. 449-456. IEEE, 2017.
  9. Sahmoudi, Issam, and Abdelmonaime Lachkar. "Formal Concept Analysis for Arabic Web Search Results Clustering." Journal of King Saud University-Computer and Information Sciences 29, No. 2, pp 196-203. 2017
  10. Copara, Jenny, Jose Ochoa, Camilo Thorne, and Goran Glavas. "Exploring unsupervised features in Conditional Random Fields for Spanish Named Entity Recognition." In Intelligent Systems (BRACIS), 2016 5th Brazilian Conference, pp. 283-288. IEEE, 2016.
  11. https://ithub.korean.go.kr
  12. https://ilis.yonsei.ac.kr
  13. http://www.sejong21.org
  14. T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," In Proceedings of workshop at ICLR, pp. 1-12, 2013.
  15. M. Kim, T. Kang,"Proposal and Analysis of Various Link Architectures in Multilayer Neural Network,"Journal of KIIT. Vol. 16, No. 4, pp. 11-19, 2018. DOI : 10.14801/jkiit.2018.16.4.11
  16. Park, Eunjeong L., and Sungzoon Cho. "KoNLPy: Korean natural language processing in Python." Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology. pp. 133-136, 2014.