Comparison of term weighting schemes for document classification

Jeong, Ho Young;Shin, Sang Min;Choi, Yong-Seok;

doi:10.5351/KJAS.2019.32.2.265

The Korean Journal of Applied Statistics (응용통계연구)

Volume 32 Issue 2
/
Pages.265-276
/
2019
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Comparison of term weighting schemes for document classification

문서 분류를 위한 용어 가중치 기법 비교

Jeong, Ho Young (Department of Statistics, Pusan National University) ;
Shin, Sang Min (Department of Management Information Systems, Dong-A University) ;
Choi, Yong-Seok (Department of Statistics, Pusan National University)

정호영 (부산대학교 통계학과) ;
신상민 (동아대학교) ;
최용석 (부산대학교 통계학과)

Received : 2019.01.03
Accepted : 2019.02.12
Published : 2019.04.30

https://doi.org/10.5351/KJAS.2019.32.2.265 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

The document-term frequency matrix is a general data of objects in text mining. In this study, we introduce a traditional term weighting scheme TF-IDF (term frequency-inverse document frequency) which is applied in the document-term frequency matrix and used for text classifications. In addition, we introduce and compare TF-IDF-ICSDF and TF-IGM schemes which are well known recently. This study also provides a method to extract keyword enhancing the quality of text classifications. Based on the keywords extracted, we applied support vector machine for the text classification. In this study, to compare the performance term weighting schemes, we used some performance metrics such as precision, recall, and F1-score. Therefore, we know that TF-IGM scheme provided high performance metrics and was optimal for text classification.

문서-용어 빈도행렬은 텍스트 마이닝에서 분석하고자 하는 개체 정보를 가지고 있는 일반적인 자료 형태이다. 본 연구에서 문서 분류를 위해 문서-용어 빈도행렬에 적용되는 기존의 용어 가중치인 TF-IDF를 소개한다. 추가하여 최근에 알려진 용어 가중치인 TF-IDF-ICSDF와 TF-IGM의 정의와 장단점을 소개하고 비교한다. 또한 문서 분류 분석의 질을 높이기 위해 핵심어를 추출하는 방법을 제시하고자 한다. 추출된 핵심어를 바탕으로 문서 분류에 있어서 가장 많이 활용된 기계학습 알고리즘 중에서 서포트 벡터 머신을 이용하였다. 본 연구에서 소개한 용어 가중치들의 성능을 비교하기 위하여 정확률, 재현율, F1-점수와 같은 성능 지표들을 이용하였다. 그 결과 TF-IGM 방법이 모두 높은 성능 지표를 보였고, 텍스트를 분류하는데 있어 최적화 된 방법으로 나타났다.

Keywords

GCGHDE_2019_v32n2_265_f0001.png 이미지

Figure 2.1. The process of ﬁnding an elbow point.

GCGHDE_2019_v32n2_265_f0002.png 이미지

Figure 3.1. Performance comparison with M1–M6 method for entire categories.

Table 2.1. IGM calculation example

GCGHDE_2019_v32n2_265_t0001.png 이미지

Table 2.2. Document-term weighted matrix generation scheme M1–M6

GCGHDE_2019_v32n2_265_t0002.png 이미지

Table 3.1. PDF ﬁles and terms of Periodical publication by institute

GCGHDE_2019_v32n2_265_t0003.png 이미지

Table 3.2. Performance comparison with M1–M6 method for individual categories

GCGHDE_2019_v32n2_265_t0004.png 이미지

References

Chen, K. and Zong, C. (2003). A new weighting algorithm for linear classifier. In Proceedings of 2003 International Conference on Natural Language Processing and Knowledge Engineering, 650-655.
Chen, K., Zhang, Z., Long, J., and Zhang, H. (2016). Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert System with Applications, 66, 245-260. https://doi.org/10.1016/j.eswa.2016.09.009
Cho, S. G., Cho, J. H., and Kim, S. B. (2015). Discovering meaningful trends in the inaugural addresses of United States Presidents Via text mining, Journal of Korean Institute of Industrial Engineers, 41, 453-460. https://doi.org/10.7232/JKIIE.2015.41.5.453
Dumais, S. (1991). Improving the retrieval of information from external sources, Behavior Research Methods, Instruments & Computers, 23, 229-236. https://doi.org/10.3758/BF03203370
Hornik, K., Meyer, D., and Karatzoglou, A. (2006). Support vector machines in R, Journal of Statisticcal Software, 15, 1-28.
Jung, M.J. (2017). A study on clustering methods for proximity data in text mining (Master thesis), Pusan National University.
Lee, M. R. and Bae, H. K. (2002). Design of keyword extraction system using TFIDF, The Korean Society for Cognitive Science, 13, 1-11.
Miner, G., Elder, J., and Hill, T. (2012). Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications, Academic Press, Seoul.
Nakov, P., Popova, A., and Mateev, P. (2001), Weight functions impact on LSA performance. In Proceeding of the Recent Advances in Natural language processing, Bulgaria, 187-193.
Ren, F. and Sohrab, M. G. (2013). Class-indexing-based term weighting for automatic text classification, Information Sciences, 236, 109-125. https://doi.org/10.1016/j.ins.2013.02.029
Satopaa, V., Albrecht, J., Irwin, D., and Raghavan, B. (2011). Finding a "kneedle" in a Haystack: Detecting Knee Points in System Behavior, Distributed Computing Systems Workshops (ICDCSW) 2011 31st International Conference on, IEEE, 166-171.
Yang, Y. and Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the ACM SIGIR Conference on Research and Development in International Retrieval, 42-49.

The Korean Journal of Applied Statistics (응용통계연구)

Comparison of term weighting schemes for document classification

문서 분류를 위한 용어 가중치 기법 비교

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)