DOI QR코드

DOI QR Code

Related Documents Classification System by Similarity between Documents

문서 유사도를 통한 관련 문서 분류 시스템 연구

  • Jeong, Jisoo (Department of Software Convergence, Sejong University) ;
  • Jee, Minkyu (Department of Software Convergence, Sejong University) ;
  • Go, Myunghyun (Department of Digital Contents, Sejong University) ;
  • Kim, Hakdong (Department of Digital Contents, Sejong University) ;
  • Lim, Heonyeong (Department of Digital Contents, Sejong University) ;
  • Lee, Yurim (Department of Artificial Intelligence and Linguistic Engineering, Sejong University) ;
  • Kim, Wonil (Department of Software, Sejong University)
  • 정지수 (세종대학교 소프트웨어융합학과) ;
  • 지민규 (세종대학교 소프트웨어융합학과) ;
  • 고명현 (세종대학교 디지털콘텐츠학과) ;
  • 김학동 (세종대학교 디지털콘텐츠학과) ;
  • 임헌영 (세종대학교 디지털콘텐츠학과) ;
  • 이유림 (세종대학교 인공지능언어공학과) ;
  • 김원일 (세종대학교 소프트웨어학과)
  • Received : 2018.11.16
  • Accepted : 2019.01.10
  • Published : 2019.01.30

Abstract

This paper proposes using machine-learning technology to analyze and classify historical collected documents based on them. Data is collected based on keywords associated with a specific domain and the non-conceptuals such as special characters are removed. Then, tag each word of the document collected using a Korean-language morpheme analyzer with its nouns, verbs, and sentences. Embedded documents using Doc2Vec model that converts documents into vectors. Measure the similarity between documents through the embedded model and learn the document classifier using the machine running algorithm. The highest performance support vector machine measured 0.83 of F1-score as a result of comparing the classification model learned.

본 논문은 머신 러닝 기술을 이용하여 과거의 수집된 문서를 분석하고 이를 바탕으로 문서를 분류하는 방법을 제안한다. 특정 도메인과 관련된 키워드를 기반으로 데이터를 수집하고, 특수문자와 같은 불용어를 제거한다. 그리고 한글 형태소 분석기를 사용하여 수집한 문서의 각 단어에 명사, 동사, 형용사와 같은 품사를 태깅한다. 문서를 벡터로 변환하는 Doc2Vec 모델을 이용해 문서를 임베딩한다. 임베딩 모델을 통하여 문서 간 유사도를 측정하고 머신 러닝 기술을 이용하여 문서 분류기를 학습한다. 학습한 분류 모델 간 성능을 비교하였다. 실험 결과, 서포트 벡터 머신의 성능이 가장 우수했으며 F1 점수는 0.83이 도출되었다.

Keywords

BSGHC3_2019_v24n1_77_f0001.png 이미지

그림 1. 벡터 거리로 단어간의 의미 파악 FIg. 1. Understand the meaning between words in vector distance

BSGHC3_2019_v24n1_77_f0002.png 이미지

그림 2. 문서 분류기의 전체적인 설계 FIg. 2. Overall Design of Document Classification

BSGHC3_2019_v24n1_77_f0003.png 이미지

그림 3. 문서 분류기의 데이터 전처리 과정 Fig. 3. Document calssification data preprocessing

BSGHC3_2019_v24n1_77_f0004.png 이미지

그림 4. 키워드를 기반으로 수집하는 웹 크롤러 설계 Fig. 4. Web crawler designs collected based on keywords

BSGHC3_2019_v24n1_77_f0005.png 이미지

그림 5. Doc2Vec 모델 구조 (PV-DM) Fig. 5. Doc2Vec Model structure (PV-DM)

BSGHC3_2019_v24n1_77_f0006.png 이미지

그림 6. 머신 러닝 기술 분류기 실험 Fig. 6. Machine Learning Algorithm Classifier Experiment

표 1. 위해도 관련 뉴스와 그 외의 뉴스의 라벨링된 데이터 예시 Table 1. Examples of hazad related News and other news labeled data

BSGHC3_2019_v24n1_77_t0001.png 이미지

표 2. Doc2Vec Parameter Table 2. Doc2Vec Parameter

BSGHC3_2019_v24n1_77_t0002.png 이미지

표 3. 분류 모델 성능 비교 분석 Table 3. Classification Model Performance Comparison Analysis

BSGHC3_2019_v24n1_77_t0003.png 이미지

References

  1. Jun-Ho Roh, Han-joon Kim, Jae-Young Chang. "Improving Hypertext Classification Systems through WordNet-based Feature Abstraction." The Jounal of Society for e-Business Studies, 18.2 pp.95-110(6) 2013.May https://doi.org/10.7838/jsebs.2013.18.2.095
  2. YunJeong Choi, SeungSoo Park. "Interplay of Text Mining and Data Mining for Classifying Web Contents." KOREAN JOURNAL OF COGNITIVE SCIENCE, 13.3 pp.33-46.(14) 2002.9
  3. Sunghae Jun "A Big Data Preprocessing using Statistical Text Mining" Journal of Korean Institute of Intelligent Systems Vol. 25, No. 5, pp. 470-476(7) 2015 October https://doi.org/10.5391/JKIIS.2015.25.5.470
  4. Eun-Soon You, Gun-Hee, Choi, Seung-Hoon Kim "Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels" Korean Society of Computer Information Volume 20, Issue 2, pp.121-129(9) 2015 February
  5. J. Ramos, "Using tf-idf to determine word relevance in document queries", In Proceedings of the First Instructional Conference on Machine Learning, 2003
  6. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean "Distributed Representations of Words and Phrases and their Compositionality" NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 pp.3111-3119(9) Lake Tahoe, Nevada December 2013
  7. Garam Choi, Sung-Pil Choi "A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables" Journal of the Korean Society for Information Management, 35(1) pp.231-250 (20) 2018.3 https://doi.org/10.3743/KOSIM.2018.35.1.231
  8. Jung-Mi Kim, Ju-Hong Lee. "Text Document Classification Based on Recurrent Neural Network Using Word2vec." Journal of Korean Institute of Intelligent Systems, 27.6 pp. 560-565 (6) 2017.12 https://doi.org/10.5391/JKIIS.2017.27.6.560
  9. Quoc Le ,Tomas Mikolov "Distributed Representations of Sentences and Documents" ICML'14 Proceedings of the 31st International Conference on International Conference on Machine Learning Volume 32 pp.1188-1196(9) Beijing, China June 2014
  10. Lucy Park, Sungzoon Cho, "KoNLPy : Korean natural language processing in Python" Proceeding soft he 26th Annual Conferenceon Human & Cognitive Language Technology, 2014 10
  11. Seong-Ho Choi, Eun-Sol Kim, Byoung-Tak Zhang "An Intention Prediction Method for Dialogue using Paragraph Vector" Korea Computer Congress 2016 pp.977-979(3) 2016.6
  12. KyuWan Kim, HyunJu Shin, SunJin Kim, KyoungDuek Moon, HyunAh Lee. "Detecting Improper Paragraphs in a News Article Using Logistic Regression Classification and Inter-class Similarity." Journal of Computing Science and Engineering pp.1873-1875.(3) 2017.12
  13. Dan-Ho Park, Won-Sik Choi, Hong-Jo Kim, Seok-Lyong Lee. "Web Document Classification System Using the Text Analysis and Decision Tree Model." Journal of Computing Science and Engineering, 38.2A 248-251.(4) 2011.11
  14. Do-Sik Min, Mu-Hee Song, Ki-Jun Son, Sang-Jo Lee. "Spam - mail Filtering Using SVM Classifier." Journal of Computing Science and Engineering 30.1B pp.552-554.(3) 2003.4
  15. Song-yi Han, Yong-Gyu Jung. "Spam Filtering Using A Complement Naive Bayesian Classifier." Journal of Computing Science and Engineering, 36.2C 325-328.(4) 2009.11
  16. scikit-learn, https://scikit-learn.org/stable/