DOI QR코드

DOI QR Code

The Effect of the Quality of Pre-Assigned Subject Categories on the Text Categorization Performance

학습문헌집합에 기 부여된 범주의 정확성과 문헌 범주화 성능

  • Published : 2006.06.01

Abstract

In text categorization a certain level of correctness of labels assigned to training documents is assumed without solid knowledge on that of real-world collections. Our research attempts to explore the quality of pre-assigned subject categories in a real-world collection, and to identify the relationship between the quality of category assignment in training set and text categorization performance. Particularly, we are interested in to what extent the performance can be improved by enhancing the quality (i.e., correctness) of category assignment in training documents. A collection of 1,150 abstracts in computer science is re-classified by an expert group, and divided into 907 training documents and 227 test documents (15 duplicates are removed). The performances of before and after re-classification groups, called Initial set and Recat-1/Recat-2 sets respectively, are compared using a kNN classifier. The average correctness of subject categories in the Initial set is 16%, and the categorization performance with the Initial set shows 17% in $F_1$ value. On the other hand, the Recat-1 set scores $F_1$ value of 61%, which is 3.6 times higher than that of the Initial set.

문헌범주화에서는 학습문헌집합에 부여된 주제범주의 정확성이 일정 수준을 가진다고 가정한다. 그러나, 이는 실제 문헌집단에 대한 지식이 없이 이루어진 가정이다. 본 연구는 실제 문헌집단에서 기 부여된 주제범주의 정확성의 수준을 알아보고, 학습문헌집합에 기 부여된 주제범주의 정확도와 문헌범주화 성능과의 관계를 확인하려고 시도하였다. 특히, 학습문헌집합에 부여된 주제범주의 질을 수작업 재색인을 통하여 향상시킴으로써 어느 정도까지 범주화 성능을 향상시킬 수 있는가를 파악하고자 하였다. 이를 위하여 과학기술분야의 1,150 초록 레코드 1,150건을 전문가 집단을 활용하여 재색인한 후, 15개의 중복문헌을 제거하고 907개의 학습문헌집합과 227개의 실험문헌집합으로 나누었다. 이들을 초기문헌집단, Recat-1, Recat-2의 재 색인 이전과 이후 문헌집단의 범주화 성능을 kNN 분류기를 이용하여 비교하였다. 초기문헌집단의 범주부여 평균 정확성은 16%였으며, 이 문헌집단의 범주화 성능은 $F_1$값으로 17%였다. 반면, 주제범주의 정확성을 향상시킨 Recat-1 집단은 $F_1$값 61%로 초기문헌집단의 성능을 3.6배나 향상시켰다.

Keywords

References

  1. Apte, C., F. Damerau, and S. M. Weiss. 1994. 'Automated learning of decision rules for text categorization.' ACM Transactions on Information Systems, 12(3) : 233-251 https://doi.org/10.1145/183422.183423
  2. Bates, M. J. 1986. 'Subject access in online catalogs : a design model.' Journal of the American Society for Information Science, 37(6) : 357-376 https://doi.org/10.1002/(SICI)1097-4571(198611)37:6<357::AID-ASI1>3.0.CO;2-H
  3. Bennett, P. N. 2003. 'Using asymmetric distributions to improve text classifier probability estimates.' In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1l1-118
  4. Blair, D. C. 1986. 'Indeterminancy in the subject access to documents.' Information Processing & Management, 22(2) : 229-241 https://doi.org/10.1016/0306-4573(86)90055-5
  5. Brank, J., M. Grobelnik, N. Milic-Frayling, and D. Mladenic, 2002. 'Feature selection using linear support vector machines.' In Proceedings of the 3rd International Conference on Data Mining Methods and Databases for Engineering
  6. Cai, L. and T. Hofmann. 2003. 'Text categorization by boosting automatically extracted concepts.' In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 182-189
  7. Chen, H., A. K. Danowitz, K J. Lynch, E. E. Goodman, and W. K. McHenry. 1994. 'Explaining and alleviating information management indeter minism : a knowledge-based framework'. Information Processing & Management, 30(4) : 557-577 https://doi.org/10.1016/0306-4573(94)90039-6
  8. Cleverdon, C. 1984. 'Optimizing convenient online access to bibliographic databases.' Information Services and Use, 4(1) : 37-47 https://doi.org/10.3233/ISU-1984-41-204
  9. Chung, Y-M. 2005. Research in information retrieval. Seoul : KuMee Trade.(In Korean)
  10. D' Alessio, S., K. Murray, and R. Schiaffino. 1998. 'The effect of using hierarchical classifiers in text categorization.' In Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing, 1-18
  11. Guthrie, L., J. Guthrie, and J. Leistensnider. 1999. 'Document classification and routing'. In Natural language information retrieval, edited by T. Strzalkowski, 289-310. Boston : Kluwer Academic Publishers
  12. Hooper, R. S. 1965. Indexer consistency test : origin, measurement, results and utilization. Bethesda, MD : IBM Corporation
  13. Hurwitz, F. I. 1969. 'A study of indexer consistency'. American Document ation, 20 : 92-94 https://doi.org/10.1002/asi.4630200112
  14. Jackson, P. and I. Moulinier. 2002. Natural language processing for online applications : text retrieval, extraction and categorization. Amsterdam : John Benjamins Publishing Company
  15. Jacoby, J. and V. Slamecka. 1962. Indexer consistency under minimal conditions. Bethesda, MD : Documentation, Inc
  16. Joachims, T. 1999. 'Transductive inference for text classification using support vector machines.' In Proceedings of ICML-99 : 16th International Conference on Machine Learning, 200-209
  17. Joachims, T. 1998. 'Text categorization with support vector machines : learning with many relevant features.' In Proceedings of ECML98, 10th European Conference on Machine Learning, 137-142
  18. Kim, S. B., B. H. Yoon, D. H. Baek, K S. Han, and H. C. Lim. 1999. Combining a linear classifier and a kNN model for text categorization. In 1999 Spring Symposium of Korean Society for Cognitive Science, 225-231
  19. Lam, W. and C. Y. Ho. 1998. 'Using a generalized instance set for automatic text categorization.' In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval, 81-89
  20. Larkey, L. S. and W. B. Croft. 1996. 'Combining classifiers in text classification.' In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 289-297
  21. Lee, H W. 2003. 'An experimental study on categorization of web documents using an ensemble classifier'. MA thesis, Yonsei University.(In Korean)
  22. Leininger, K. 2000. 'Interindexer consistency in PsycINFO.' Journal of Librarianship and Information Science, 32(1) : 4-8 https://doi.org/10.1177/096100060003200102
  23. Lewis, D. D. and W. A. Gale. 1994. 'A sequential algorithm for training text classifiers.' In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. 3-12
  24. Lewis, D. D., R. E. Schapire, J. P. Callan, and R. Papka. 1996. 'Training algorithms for linear text classifiers.' In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, 298-306
  25. Ruiz. M. E. and P. Srinivasan. 1999. 'Combining machine learning and hierarchical indexing structures for text categorization.' In Proceedings of the 10th ASIS SIG/CR Classification Research Workshop, 107-124
  26. Saracevic, T. 1991. 'Individual differences in organizing, searching and retrieving information' . In Proceedings of the 54th Annual Meeting of the Society for Information Science, 82-86
  27. Sebastiani, F. 2002. 'Machine learning in automated text categorization.' ACM Computing Surveys, 34(1) : 1-47 https://doi.org/10.1145/505282.505283
  28. Shim, K. 2006. An experimental study ascertaining the relationships between the characteristics of a training document set and the performance of text categorization. Ph.D. diss., Yonsei University.(In Korean)
  29. Van Rijsbergen, C. J. 1979. Information retrieval. 2nd ed. London : Butter Worths
  30. Weiss, S. M., C. Apte F. J. Damerau, D. E. Johnson, F. J. Oles, T. Goetz, and T. Hampp. 1999. 'Maximizing text-mining performance.' IEEE Intelligent Systems, 14(4) : 63-69 https://doi.org/10.1109/5254.784086
  31. Weiss, S. A., S. Kasif, and E. Brill. 1996. 'Text classification in USENET Newsgroups : a progress report'. In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access
  32. Yang, Y. 1999. 'An evaluation of statistical approaches to text categorization.' Information Retrieval, 1 : 69-90 https://doi.org/10.1023/A:1009982220290
  33. Yang, Y. 1994. 'Expert network : effective and efficient learning from human decisions in text categorization and retrieval.' In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, 13-22
  34. Yang, Y. and X. Liu. 1999. 'An reexamination of text categorization methods.' In Proceedings of SIGIR99, 22nd ACM International Conference on Research and Development in Information Retrieval, 42-49