DOI QR코드

DOI QR Code

An Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criteria and Weighting Methods

자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구

  • Published : 2005.06.01

Abstract

This study aims to find consistent strategies for feature selection and feature weighting methods, which can improve the effectiveness and efficiency of kNN text classifier. Feature selection criteria and feature weighting methods are as important factor as classification algorithms to achieve good performance of text categorization systems. Most of the former studies chose conflicting strategies for feature selection criteria and weighting methods. In this study, the performance of several feature selection criteria are measured considering the storage space for inverted index records and the classification time. The classification experiments in this study are conducted to examine the performance of IDF as feature selection criteria and the performance of conventional feature selection criteria, e.g. mutual information, as feature weighting methods. The results of these experiments suggest that using those measures which prefer low-frequency features as feature selection criterion and also as feature weighting method. we can increase the classification speed up to three or five times without loosing classification accuracy.

이 연구에서는 문서 자동분류에서 분류자질 선정과 가중치 할당을 위해서 일관된 전략을 채택하여 kNN 분류기의 성능을 향상시킬 수 있는 방안을 모색하였다. 문서 자동 분류에서 분류자질 선정 방식과 자질 가중치 할당 방식은 자동분류 알고리즘과 함께 분류성능을 좌우하는 중요한 요소이다. 기존 연구에서는 이 두 방식을 결정할 때 상반된 전략을 사용해왔다. 이 연구에서는 색인파일 저장공간과 실행시간에 따른 분류성능을 기준으로 분류자질 선정 결과를 평가해서 기존 연구와 다른 결과를 얻었다. 상호정보량과 같은 저빈도 자질 선호 기준이나 심지어는 역문헌빈도를 이용해서 분류 자질을 선정하는 것이 kNN 분류기의 분류 효과와 효율 면에서 바람직한 것으로 나타났다. 자질 선정기준으로 저빈도 자질 선호 척도를 자질 선정 및 자질 가중치 할당에 일관되게 이용한 결과 분류성능의 저하 없이 kNN 분류기의 처리 속도를 약 3배에서 5배정도 향상시킬 수 있었다.

Keywords

References

  1. 김제욱, 김한준, 이상구. 2002. '베이지언 문서분류시스템을 위한 능동적 학습기반의 학습문서집합 구성방법.' '정보과학회 논문지 : 소프트웨어 및 응용', 29(11/12): 966-978
  2. 박부영. 2004. '잠재의미색인(LSI) 기법을 이용한 kNN 분류기의 자질 선정에 관한 연구'. 연세대학교 석사학위논문
  3. 이재문. 2002. '휴리스틱을 이용한 kNN의 효율성 개선.' '한국정보처리학회 논문지 B', 10(6): 719-724
  4. 이재윤, 유수현. 2003. '대표용어를 이용한 kNN 분류기의 처리속도 개선.' '제10회 한국정보관리학회 학술대회 논문집', pp.65-72
  5. Alpaydin, Ethem. 2004. Introduction to Machine Learning. MIT Press
  6. Bell, T. A. H., and A. Moffat. 1996. 'The design of a high performance information filtering system.' Proceedings of the 19th Annual ACM Conference on Research and Development in Information Retrieval, pp. 12-20
  7. Blumberg, Robert, and Shaku Atre. 2003. 'Automatic classification: moving to the mainstream.' DM Review Magazine, (April, 2003). [cited 2005. 3.10]
  8. Cöster, Rickard, and Martin Svensson. 2002. 'Inverted file search algorithms for collaborative filtering.' Proceedings of the 25th Annual ACM Conference on Research and Development in Information Retrieval, pp.246-252
  9. Forman, George. 2002. 'An extensive empirical study of feature selection metrics for text classification.' Journal of Machine Learning Research, 3: 1289-1305 https://doi.org/10.1162/153244303322753670
  10. Fragoudis, D., D. Meretakis, and S. Likothanassis. 2005. 'Best terms: an efficient feature-selection algorithm for text categorization'. Knowledge and Information Systems, 8(1): 16-33 https://doi.org/10.1007/s10115-004-0177-2
  11. Galavotti, L., F. Sebastiani, and M. Simi. 2000. 'Experiments on the use of feature selection and negative evidence in automated text categorization.' Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, 59-68
  12. Lim, Heui-Seok. 2002. 'An improved kNN learning based korean text classifier with heuristic information.' Proceedings of the 9th International Conference on Neural Information Processing(ICONIP'02), Vol.2, pp.731-734
  13. Manning, C. D., and H. Schtze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, Mass.: MIT Press
  14. Persin, M. 1994. 'Document filtering for fast ranking.' Proceedings of the 17th Annual ACM Conference on Research and Development in Information Retrieval, pp. 341–348
  15. Sebastiani, Fabrizio. 2002. 'Machine learning in automated text categorization.' ACM Computing Surveys, 34(1): 1-47 https://doi.org/10.1145/505282.505283
  16. Yang, Y., and J. P. Pederson. 1997. 'A comparative study on feature selection in text categorization.' Proceedings of the Fourteenth International Conference on Machine Lear- ning, 412-420
  17. Yang, Y., and Xin Liu. 1999. 'A reexamination of text categorizationi methods.' Proceedings of the 22nd Annual ACM Conference on Research and Development in Information Retrieval, pp.42-49
  18. Yang, Y. 1999. 'An evaluation of statistical aqpproaches to text categorization.' Information Retrieval, 1(1-2): 69-90 https://doi.org/10.1023/A:1009982220290
  19. Zhou, S., T. W. Ling, J. Guan, J. Hu, and A. Zhou. 2003. 'Fast text classification: A training-corpus pruning based approach.' Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA'2003), pp. 127-136
  20. Zu, G., W. Ohyama, T. Wakabayashi, and F. Kimura. 2003. 'Accuracy improvement of automatic text classification based on feature transformation.' Proceedings of the 2003 ACM Symposium on Document Engineering, pp.118-120

Cited by

  1. An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database vol.48, pp.3, 2014, https://doi.org/10.4275/KSLIS.2014.48.3.491