An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems

베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성방법

  • 김제욱 (대우정보시스템 기술연구소) ;
  • 김한준 (서울대학교 공과대학 컴퓨터공학부) ;
  • 이상구 (서울대학교 공과대학 컴퓨터공학부)
  • Published : 2002.12.01

Abstract

There are two important problems in improving text classification systems based on machine learning approach. The first one, called "selection problem", is how to select a minimum number of informative documents from a given document collection. The second one, called "composition problem", is how to reorganize selected training documents so that they can fit an adopted learning method. The former problem is addressed in "active learning" algorithms, and the latter is discussed in "boosting" algorithms. This paper proposes a new learning method, called AdaBUS, which proactively solves the above problems in the context of Naive Bayes classification systems. The proposed method constructs more accurate classification hypothesis by increasing the valiance in "weak" hypotheses that determine the final classification hypothesis. Consequently, the proposed algorithm yields perturbation effect makes the boosting algorithm work properly. Through the empirical experiment using the Routers-21578 document collection, we show that the AdaBUS algorithm more significantly improves the Naive Bayes-based classification system than other conventional learning methodson system than other conventional learning methods

기계학습 기법을 이용한 문서분류시스템의 정확도를 결정하는 요인 중 가장 중요한 것은 학습문서 집합의 선택과 그것의 구성방법이다. 학습문서집합 선택의 문제란 임의의 문서공간에서 보다 정보량이 큰 적은 양의 문서집합을 골라서 학습문서로 채택하는 것을 말한다. 이렇게 선택한 학습문서집합을 재구성하여 보다 정확도가 높은 문서분류함수를 만드는 것이 학습문서집합 구성방법의 문제이다. 전자의 문제를 해결하는 대표적인 알고리즘이 능동적 학습(active learning) 알고리즘이고, 후자의 경우는 부스팅(boosting) 알고리즘이다. 본 논문에서는 이 두 알고리즘을 Naive Bayes 문서분류 알고리즘에 적응해보고, 이때 생기는 여러 가지 특징들을 분석하여 새로운 학습문서집합 구성방법인 AdaBUS 알고리즘을 제안한다. 이 알고리즘은 능동적 학습 알고리즘의 아이디어를 이용하여 최종 문서분류함수룰 만들기 위해 임시로 만든 여러 임시 문서분류함수(weak hypothesis)들 간의 변이(variance)를 높였다. 이를 통해 부스팅 알고리즘이 효과적으로 구동되기 위해 필요한 핵심 개념인 교란(perturbation)의 효과를 실현하여 문서분류의 정확도를 높일 수 있었다. Router-21578 문서집합을 이용한 경험적 실험을 통해, AdaBUS 알고리즘이 기존의 알고리즘에 비해 Naive Bayes 알고리즘에 기반한 문서분류시스템의 정확도를 보다 크게 향상시킨다는 사실을 입증한다.

Keywords

References

  1. Tom M. Mitchell. Machine Learning. McGraw-Hill International Editions, chapter 6, 1997
  2. R. Agrawal, R. Bayardo, and R. Srikant. Athena: Mining-based Interactive Management of Text Databases. In Proceedings of the 7th International Conference on Extending Database Technology, pages 365-379, 2000
  3. Pedro Domingos and Michael Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pages 105-112, 1996
  4. 김제욱, 김한준, 이상구, Naive Bayes 문서 분류기를 위한 점진적 학습 모델 연구, 정보기술과 데이타베이스 저널, 8(1), pages 95-104, 2001
  5. David D. Lewis and William A. Gale. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 17th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 3-12, 1994
  6. Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the 13th International Conference on Machine earning, pages 148-156, 1996
  7. David D. Lewis and Jason Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning. In Proceedings of the 11th international Conference on Machine Learning, pages 148-156, 1994
  8. M. Trensh, N. Palmer, and A. Luniewski. Type Classification of Semi-structured Documents. In Proceedings of the 21st ACM SIGMOD International Conference on Management of Data, 1995
  9. Yoav Freund and Robert E. Schapire, A Decisiontheoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), pages 119-139, 1997 https://doi.org/10.1006/jcss.1997.1504
  10. J. R. QuinJan. Bagging, Boosting, and c4.5. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 725-730. 1996
  11. Robert E. Schapire. The Strength of Weak Learnability, Machine Learning, 5(2), pages 197-227, 1990 https://doi.org/10.1023/A:1022648800760
  12. Robert E. Schapire and Yoram Singer. Boos Texter: A Boosting-based System for Text Categorization. Machine Learning, 39(2), pages 135-168, 2000 https://doi.org/10.1023/A:1007649029923
  13. Robert E. Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-orated Predictions. Machine Learning, 37(3), pages 297-336, 1999 https://doi.org/10.1023/A:1007614523901
  14. Leo Breiman. Arcing Classifiers. The Annals of Statistics, 26(3), pages 801-849, 1998 https://doi.org/10.1214/aos/1024691079
  15. Kai Ming Ting and Zijian Zheng. Improving the Performance of Boosting for Naive Bayesian Classification. In Proceedings of the 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1999
  16. Zijian Zheng. Naive Bayesian Classifier Committees. In Proceedings of European Conference on Machine Learning, pages 196-207, 1998 https://doi.org/10.1007/BFb0026690
  17. Ron Kohavi, David H. Wolpert. Bias Plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the 13th International Conference on Machine Learning, pages 275-283, 1996
  18. Yiming Yang. An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, 1(1), pages 67-88, 1999 https://doi.org/10.1023/A:1009982220290
  19. Yiming Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning, pages 42-420, 1997