DOI QR코드

DOI QR Code

Improving the Performance of SVM Text Categorization with Inter-document Similarities

문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구

  • Published : 2005.09.30

Abstract

The purpose of this paper is to explore the ways to improve the performance of SVM (Support Vector Machines) text classifier using inter-document similarities. SVMs are powerful machine learning systems, which are considered as the state-of-the-art technique for automatic document classification. In this paper text categorization via SVMs approach based on feature representation with document vectors is suggested. In this approach, document vectors instead of index terms are used as features, and vector similarities instead of term weights are used as feature values. Experiments show that SVM classifier with document vector features can improve the document classification performance. For the sake of run-time efficiency, two methods are developed: One is to select document vector features, and the other is to use category centroid vector features instead. Experiments on these two methods show that we can get improved performance with small vector feature set than the performance of conventional methods with index term features.

이 논문의 목적은 SVM(지지벡터기계) 분류기의 성능을 문헌간 유사도를 이용해서 향상시키는 것이다. SVM은 효과적인 기계학습 시스템으로서 최고 수준의 문헌자동분류 기술로 인정받고 있다. 이 연구에서는 문헌 벡터 자질 표현에 기반한 SVM 문헌자동분류를 제안하였다. 제안한 방식은 분류 자질로 색인어 대신 문헌 벡터를, 자질 값으로 가중치 대신 벡터유사도를 사용한다. 제안한 방식에 대한 실험 결과, SVM 분류기의 성능을 향상시킬 수 있었다. 실행 효율 향상을 위해서 문헌 벡터 자질 선정 방안과 범주 센트로이드 벡터를 사용하는 방안을 제안하였다. 실험 결과 소규모의 벡터 자질 집합만으로도 색인어 자질을 사용하는 기존 방식보다 나은 성능을 얻을 수 있었다.

Keywords

References

  1. 김지영, 장동현, 맹성현, 이석훈, 서정현, 김현. 2000. 한국어 테스트 컬렉션 HANTEC의 확장 및 보완. '제 12회 한글 및 한국어 정보치리학술대회 논문집'. 210-215
  2. 정영미, 이재윤. 2001. 지식 분류의 자동화를 위한 클러스터링 모형 연구. '정보관리학회지'. 18(2): 203-230
  3. 정영미, 임혜영. 2000. SVM 분류기를 이용한 문서 범주화 연구. '정보관리학회지]' 17(4): 229-248
  4. Basu. A.. C. Watters. and M. Shepherd. 2003. ' Support vector machines for text categorization.' proceedings of the 36th Hawaii International Conference on System Sciences (HICSS'03)
  5. Caldas. Carlos H.. and Lucio Soibelman. 2003. ' Automating hierarchical document classification for construction management information systems. ' Automation in Construction. 12(4): 395-406 https://doi.org/10.1016/S0926-5805(03)00004-9
  6. Cristianini, N.. and J. Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods Cambridge University Press
  7. Drucker. H.. B. Shahrary. and D. C. Gibbon. 2002. ' Support vector machines: relevance feedback and information retrieval.' Information Processing & Management. 38(3):: 305-323 https://doi.org/10.1016/S0306-4573(01)00037-1
  8. Dumais. S.. J. Platt. D. Heckerman. and M. Saharni. 1998. ' Inductive learning algorithms and representations for text categorization.' Proceedings of the Seventh International Conference on Information and Knowledge Management, pp. 148-155
  9. Fukunaga, Keinosuke. 1990. Introduction to Statistical Pattern Recognition. 2nd ed. San Diego, CA: Academic Press
  10. Joachims, T. 1998. ' Text categorization with support vector machines: Learning with many relevant features.' Proceedings of the 10th European Conference on Machine Learning, pp. 137-142
  11. Nigam, Kamal. 2001. Using Unlabeled Data to Improve Text Classification. Doctoral Dissertation, Computer Science Department, Carnegie Mellon University
  12. Rogati, Monica, and Yiming Yang. 2002. ' High-performing feature selection for text classification.' Proceedings of the Eleventh International Conference on Infermation and Knowledge Management (CIKM '02), pp. 659-661
  13. Salton, Gerard, and Michael J. McGill. 1983. Introduction to Modern Information Retrieval. New York: McGraw-Hill
  14. Taira, H., and M. Haruno. 1999. ' Feature selection in SVM text categorization.' Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99), pp. 480-486
  15. Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. New York: Springer
  16. Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco: Morgan Kaufmann
  17. Yang, Y, and J. P. Pederson. 1997. ' A comparative study on feature selection in text categorization.' Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412-420
  18. Yang, Y Y, and X Liu. 1999. 'A re-examination of text categorization methods.' Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99), pp. 42-49.