DOI QR코드

DOI QR Code

Research on Text Classification of Research Reports using Korea National Science and Technology Standards Classification Codes

국가 과학기술 표준분류 체계 기반 연구보고서 문서의 자동 분류 연구

  • Choi, Jong-Yun (Department of Computer Engineering, Kumoh National Institute of Technology) ;
  • Hahn, Hyuk (Korea Institute of Science and Technology Information) ;
  • Jung, Yuchul (Department of Computer Engineering, Kumoh National Institute of Technology)
  • 최종윤 (금오공과대학교 컴퓨터공학과) ;
  • 한혁 (한국과학기술정보연구원) ;
  • 정유철 (금오공과대학교 컴퓨터공학과)
  • Received : 2019.09.05
  • Accepted : 2020.01.03
  • Published : 2020.01.31

Abstract

In South Korea, the results of R&D in science and technology are submitted to the National Science and Technology Information Service (NTIS) in reports that have Korea national science and technology standard classification codes (K-NSCC). However, considering there are more than 2000 sub-categories, it is non-trivial to choose correct classification codes without a clear understanding of the K-NSCC. In addition, there are few cases of automatic document classification research based on the K-NSCC, and there are no training data in the public domain. To the best of our knowledge, this study is the first attempt to build a highly performing K-NSCC classification system based on NTIS report meta-information from the last five years (2013-2017). To this end, about 210 mid-level categories were selected, and we conducted preprocessing considering the characteristics of research report metadata. More specifically, we propose a convolutional neural network (CNN) technique using only task names and keywords, which are the most influential fields. The proposed model is compared with several machine learning methods (e.g., the linear support vector classifier, CNN, gated recurrent unit, etc.) that show good performance in text classification, and that have a performance advantage of 1% to 7% based on a top-three F1 score.

과학기술 분야의 연구·개발 결과는 연구보고서 형태로 국가과학기술정보서비스(NTIS)에 제출된다. 각 연구보고서는 국가과학기술 표준 분류체계 (K-NSCC)에 따른 분류코드를 가지고 있는데, 보고서 작성자가 제출 시에 수동으로 입력하게끔 되어있다. 하지만 2000여 개가 넘는 세분류를 가지고 있기에, 분류체계에 대한 정확한 이해가 없이는 부정확한 분류코드를 선택하기 십상이다. 새로이 수집되는 연구보고서의 양과 다양성을 고려해 볼 때, 이들을 기계적으로 보다 정확하게 분류할 수 있다면 보고서 제출자의 수고를 덜어줄 수 있을 뿐만 아니라, 다른 부가 가치적인 분석 서비스들과의 연계가 수월할 것이다. 하지만, 국내에서 과학기술표준 분류체계에 기반을 둔 문서 자동 분류 연구 사례는 거의 없으며 공개된 학습데이터도 전무하다. 본 연구는 KISTI가 보유하고 있는 최근 5년간 (2013년~2017년) NTIS 연구보고서 메타정보를 활용한 최초의 시도로써, 방대한 과학기술표준 분류체계를 기반으로 하는 국내 연구보고서들을 대상으로 높은 성능을 보이는 문서 자동 분류기법을 도출하는 연구를 진행하였다. 이를 위해, 과학기술 표준분류 체계에서 과학기술 분야의 연구보고서를 분류하기에 적합한 중분류 210여 개를 선별하였으며, 연구보고서 메타 데이터의 특성을 고려한 전처리를 진행하였다. 특히, 가장 영향력 있는 필드인 과제명(제목)과 키워드만을 이용한 TK_CNN 기반의 딥러닝 기법을 제안한다. 제안 모델은 텍스트 분류에서 좋은 성능을 보이고 있는 기계학습법들 (예, Linear SVC, CNN, GRU등)과 비교하였으며, Top-3 F1점수 기준으로 1~7%에 이르는 성능 우위를 확인하였다.

Keywords

References

  1. C. H. Song, and S. S. Sung. 2006. "A Study on the Problems of Current National Standard Classification of Science and Technology for National Science and Technology Information System." : pp.496-513.
  2. Y. Kim. 2014. "Convolutional Neural Networks for Sentence Classification." EMNLP 2014: 1746-51. DOI: https://doi.org/10.3115/v1/D14-1181
  3. P. Liu, X. Qiu, and X. Huang. 2016. "Recurrent Neural Network for Text Classification with Multi-Task Learning." AAAI Publications, Twenty-Ninth AAAI Conference on Artificial Intelligence: 2267-2273.
  4. S. Fabrizio. 2002. "Machine Learning in Automated Text Categorization." ACM Computing Surveys 34: 1-47. DOI:https://doi.org/10.1145/505282.505283
  5. L. Saitta. 1995. Nov "Support-Vector Networks." Machine Learning 20(3): 273-97. DOI: https://doi.org/10.1007/BF00994018
  6. C. Nello, J. Shawe-Taylor, and B. Williamson. 2001. "On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines." Machine Learning Research 2: 265-92. DOI: https://doi.org/10.1007/BF00994018
  7. Y. H. Kim, S. Y. Kang, and M. J. Choi. 2015. "Improvement of National Science and Technology Standard Classification System in 2015" Research and Development, Korea Institute of Science and Technology Evaluation and Planning, Korea, pp.1-221.
  8. J. Weston, et al. 2000. "Feature Selection for SVMs." Advances in Neural Information Processing Systems 13: 668-674.
  9. Scikit learn's SVC, Available at https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
  10. X. Zhang, J. Zhao, and Y. LeCun. 2015. Character-level convolutional networks for text classification. arXiv preprint arXiv:1509.01626.
  11. S. Hochreiter, and J. Schmidhuber. 1997. "Long Short-Term Memory." Neural Computation 9(8): p.1735-1780. DOI: https://doi.org/10.1162/neco.1997.9.8.1735
  12. J. Y. Chung, G. Caglar, K. H. Cho, and Y. Bengio. 2014. "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." NIPS 2014 Workshop on Deep Learning: p.1-9.
  13. P. Zhou et al. 2016. "Text Classification Improved by Integrating Bidirectional LSTM with Two-Dimensional Max Pooling." Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics 2(1): 3485-95.
  14. T. Mikolov, et al. 2013. "Distributed Representations of Words and Phrases and Their Compositionality." Advances in Neural Information Processing Systems 26 (NIPS 2013): 1-9.
  15. J. Pennington, R. Socher, and C. D. Manning. 2014. "GloVe : Global Vectors for Word Representation." EMNLP: 1532-1543. DOI: https://doi.org/10.3115/v1/D14-1162
  16. H. Jo, et al. 2015. "Large-Scale Text Classification Methodology with Convolutional Neural Network." Korean Information Science Society: 792-94. DOI: http://dx.doi.org/10.5626/KTCP.2017.23.5.322
  17. E. J. Park, and S. Z. Cho. 2014. "KoNLPy : Korean Natural Language Processing in Python." Annual Conference on Human and Language Technology: pp.133-136.
  18. Gensim Word2Vec, Available at https://radimrehurek.com/gensim/models/word2vec.html
  19. Scikit learn's Linear SVC, Available at https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
  20. H. Y. Jo, et al. 2017. "Large-Scale Text Classification with Deep Neural Networks." KIISE Transactions on Computing Practices 23: 322-27. DOI: https://doi.org/10.5626/KTCP.2017.23.5.322
  21. J. S. Jeong et al. 2019. "Related Documents Classification System by Similarity between Documents." The Korean Society Of Broad Engineers 24(1): 77-86. DOI: https://doi.org/10.5909/JBE.2019.24.1.77
  22. K. Y. Kim and C. J. Park. 2019. "Automatic IPC Classification of Patent Documents Using Word2Vec and Two Layers Bidirectional Long Short Term Memory Network." THE JOURNAL OF KOREAN INSTITUTE OF NEXT GENERATION COMPUTING 15(2): 50-60.
  23. M. J. Seo, G. S. Ahn, and S. Hur. 2019. "Feature Selection Method from Multiclass Text with Class Imbalance Problem." Journal of the Korean Institute of Industrial Engineers (April): 1-8.
  24. K. Kowsari et al. 2017. "HDLTex : Hierarchical Deep Learning for Text Classification." 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA): 364-71. DOI: https://doi.org/10.1109/ICMLA.2017.0-134
  25. Jacob, Devlin, Ming-wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT: 4171-4186.
  26. R. A. Sinoara et al. 2019. "Knowledge-Based Systems Knowledge-Enhanced Document Embeddings for Text Classification." Knowledge-Based Systems 163: 955-71. DOI: https://doi.org/10.1016/j.knosys.2018.10.026
  27. S. Lai, L. Xu, K. Liu, and J. Zhao. 2015. "Recurrent Convolutional Neural Networks for Text Classification." Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Recurrent: 2267-73.