DOI QR코드

DOI QR Code

Comparison of term weighting schemes for document classification

문서 분류를 위한 용어 가중치 기법 비교

  • Jeong, Ho Young (Department of Statistics, Pusan National University) ;
  • Shin, Sang Min (Department of Management Information Systems, Dong-A University) ;
  • Choi, Yong-Seok (Department of Statistics, Pusan National University)
  • Received : 2019.01.03
  • Accepted : 2019.02.12
  • Published : 2019.04.30

Abstract

The document-term frequency matrix is a general data of objects in text mining. In this study, we introduce a traditional term weighting scheme TF-IDF (term frequency-inverse document frequency) which is applied in the document-term frequency matrix and used for text classifications. In addition, we introduce and compare TF-IDF-ICSDF and TF-IGM schemes which are well known recently. This study also provides a method to extract keyword enhancing the quality of text classifications. Based on the keywords extracted, we applied support vector machine for the text classification. In this study, to compare the performance term weighting schemes, we used some performance metrics such as precision, recall, and F1-score. Therefore, we know that TF-IGM scheme provided high performance metrics and was optimal for text classification.

문서-용어 빈도행렬은 텍스트 마이닝에서 분석하고자 하는 개체 정보를 가지고 있는 일반적인 자료 형태이다. 본 연구에서 문서 분류를 위해 문서-용어 빈도행렬에 적용되는 기존의 용어 가중치인 TF-IDF를 소개한다. 추가하여 최근에 알려진 용어 가중치인 TF-IDF-ICSDF와 TF-IGM의 정의와 장단점을 소개하고 비교한다. 또한 문서 분류 분석의 질을 높이기 위해 핵심어를 추출하는 방법을 제시하고자 한다. 추출된 핵심어를 바탕으로 문서 분류에 있어서 가장 많이 활용된 기계학습 알고리즘 중에서 서포트 벡터 머신을 이용하였다. 본 연구에서 소개한 용어 가중치들의 성능을 비교하기 위하여 정확률, 재현율, F1-점수와 같은 성능 지표들을 이용하였다. 그 결과 TF-IGM 방법이 모두 높은 성능 지표를 보였고, 텍스트를 분류하는데 있어 최적화 된 방법으로 나타났다.

Keywords

GCGHDE_2019_v32n2_265_f0001.png 이미지

Figure 2.1. The process of finding an elbow point.

GCGHDE_2019_v32n2_265_f0002.png 이미지

Figure 3.1. Performance comparison with M1–M6 method for entire categories.

Table 2.1. IGM calculation example

GCGHDE_2019_v32n2_265_t0001.png 이미지

Table 2.2. Document-term weighted matrix generation scheme M1–M6

GCGHDE_2019_v32n2_265_t0002.png 이미지

Table 3.1. PDF files and terms of Periodical publication by institute

GCGHDE_2019_v32n2_265_t0003.png 이미지

Table 3.2. Performance comparison with M1–M6 method for individual categories

GCGHDE_2019_v32n2_265_t0004.png 이미지

References

  1. Chen, K. and Zong, C. (2003). A new weighting algorithm for linear classifier. In Proceedings of 2003 International Conference on Natural Language Processing and Knowledge Engineering, 650-655.
  2. Chen, K., Zhang, Z., Long, J., and Zhang, H. (2016). Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert System with Applications, 66, 245-260. https://doi.org/10.1016/j.eswa.2016.09.009
  3. Cho, S. G., Cho, J. H., and Kim, S. B. (2015). Discovering meaningful trends in the inaugural addresses of United States Presidents Via text mining, Journal of Korean Institute of Industrial Engineers, 41, 453-460. https://doi.org/10.7232/JKIIE.2015.41.5.453
  4. Dumais, S. (1991). Improving the retrieval of information from external sources, Behavior Research Methods, Instruments & Computers, 23, 229-236. https://doi.org/10.3758/BF03203370
  5. Hornik, K., Meyer, D., and Karatzoglou, A. (2006). Support vector machines in R, Journal of Statisticcal Software, 15, 1-28.
  6. Jung, M.J. (2017). A study on clustering methods for proximity data in text mining (Master thesis), Pusan National University.
  7. Lee, M. R. and Bae, H. K. (2002). Design of keyword extraction system using TFIDF, The Korean Society for Cognitive Science, 13, 1-11.
  8. Miner, G., Elder, J., and Hill, T. (2012). Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications, Academic Press, Seoul.
  9. Nakov, P., Popova, A., and Mateev, P. (2001), Weight functions impact on LSA performance. In Proceeding of the Recent Advances in Natural language processing, Bulgaria, 187-193.
  10. Ren, F. and Sohrab, M. G. (2013). Class-indexing-based term weighting for automatic text classification, Information Sciences, 236, 109-125. https://doi.org/10.1016/j.ins.2013.02.029
  11. Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B. (2011). Finding a "kneedle" in a Haystack: Detecting Knee Points in System Behavior, Distributed Computing Systems Workshops (ICDCSW) 2011 31st International Conference on, IEEE, 166-171.
  12. Yang, Y. and Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the ACM SIGIR Conference on Research and Development in International Retrieval, 42-49.