Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities

Kim, Pan-Jun;Lee, Jae-Yun;

doi:10.3743/KOSIM.2007.24.1.251

Journal of the Korean Society for information Management (정보관리학회지)

Volume 24 Issue 1 Serial No. 63
/
Pages.251-271
/
2007
/
1013-0799(pISSN)
/
2586-2073(eISSN)

Korean Society for Information Management (한국정보관리학회)

DOI QR Code

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities

문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구

김판준 (연세대학교 문헌정보학과) ;
이재윤 (경기대학교 문헌정보학)

Published : 2007.03.30

https://doi.org/10.3743/KOSIM.2007.24.1.251 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps In general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.

문헌간 유사도를 자질로 사용하는 분류기에서 미분류 문헌을 학습에 활용하여 분류 성능을 높이는 방안을 모색해 보았다. 자동분류를 위해서 다량의 학습문헌을 수작업으로 확보하는 것은 많은 비기 들기 때문에 미분류 문헌의 활용은 실용적인 면에서 중요하다. 미분류 문헌을 활용하는 준지도학습 알고리즘은 대부분 수작업으로 분류된 문헌을 학습데이터로 삼아서 미분류 문헌을 분류하는 첫 번째 단계와, 수작업으로 분류된 문헌과 자동으로 분류된 문헌을 모두 학습 데이터로 삼아서 분류기를 학습시키는 두 번째 단계로 구성된다. 이 논문에서는 문헌간 유사도 자질을 적용하는 상황을 고려하여 두 가지 준지도학습 알고리즘을 검토하였다. 이중에서 1단계 준지도학습 방식은 미분류 문헌을 문헌유사도 자질 생성에만 활용하므로 간단하며, 2단계 준지도학습 방식은 미분류 문헌을 문헌유사도 자질생성과 함께 학습 예제로도 활용하는 알고리즘이다. 지지벡터기계와 나이브베이즈 분류기를 이용한 실험 결과, 두 가지 준지도학습 방식 모두 미분류 문헌을 활용하지 않는 지도학습 방식보다 높은 성능을 보이는 것으로 나타났다. 특히 실행효율을 고려한다면 제안된 1단계 준지도학습 방식이 미분류 문헌을 활용하여 분류 성능을 높일 수 있는 좋은 방안이라는 결론을 얻었다.

Keywords

References

김지영, 장동현, 맹성현 이석훈, 서정현, 김현. 2000. 한국어 테스트 컬렉션 HANTEC의 확장 및 보완. 제12회 한글 및 한국어 정보처리 학술대회 논문집, 210-215
김판준.2006. 기계학습을 통한 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(1): 279-299 https://doi.org/10.3743/KOSIM.2006.23.1.279
이재윤. 2005a. 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. 정보관리학회지, 22(3) :261-287 https://doi.org/10.3743/KOSIM.2005.22.3.261
이재윤. 2005b. 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 관한 연구. 한국문헌정보학회지, 39(2): 123-146.
정영미. 2005. 정보검색연구. 서울: 구미 무역(주) 출판부
Basu, S., A. Banerjee, and R. Mooney. 2002. "Semi-supervised clustering by seeding." Proceedings of the Nineteenth International Conference on Machine Learning (ICML-02), 19-26
Basu, S., M. Bilenko, and R.J. Mooney. 2004. "A probabilistic framework for semi-supervised clustering." Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 59-68.
Bennett, K. P., and A. Demiriz. 1998. "A semi supervised support vector machines." Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, 368-374.
Blum, A., and T. Mitchell. 1998. "Combining labeled and unlabeled data with co-training." Proceedings of the Eleventh Annual Conference on Computational Learning Theory (COLT-98), 92-100.
Bockhorst, J., and M. Craven. 2002. "Exploiting relations among concepts to acquire weakly labeled training data." Proceedings of the 19th International Conference on Machine Learning (ICML-02), 43-50
Cohn, D., R. Caruana, and A McCallum. 2003. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University. [cited 2006. 11. 9].
Dattola, R. T. 1969. "A fast algorithm for automatic classification." Journal of Library Automation, 2(1): 31-48
Denis, F. 1998. "PAC learning from positive statistical queries." Proceedings of the 9th International Conference on Algorithmic Learning Theory (ALT-98), 112-126
Denis, F., R. Gilleron, and M. Tommasi. 2002. "Text classification from positive and unlabeled examples." Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU-02). [cited 2006. 10. 30].
Ghani, R. 2002. "Combining labeled and unlabeled data for mu1ticlass text categorization." Proceedings of the 19th International Conference on Machine Learning (ICML-02), 187-194.
Goldman, S., and Y. Zhou. 2000. "Enhancing supervised learning with unlabeled data." Proceedings of the 17th International Conference on Machine Learning (ICML-00), 327-334
Jain, A. K., and R. C. Dubes. 1988. Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall
Joachims, T. 1999. "Transductive inference for text classification using Support Vector Machines." Proceedings of 16th International Conferenee on Machine Learning (ICML-99), 200-209.
Lewis, D. D., and W. A. Gale. 1994. "A sequential algorithm for training text classifiers." Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 3-12.
Liu, B., Y. Dai,, X. Li, W. S. Lee, and P. S. Yu. 2003. "Building text classifiers using positive and unlabeled examples." Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03), 179-188.
McCallum, A., and K. Nigam. 1998. "Employing EM and pool-based active learning with keywords, EM and shrinkage." Proceedings of 16th International Conference on Machine Learning (ICML98), 359-367.
Muslea, I., S. Minton, and C. Knoblock. 2002. "Active+semi-supervised learning = robust multi-view learning." Proceedings of the Nineteenth International Conference on Machine Learning (ICMA -02), 435-442
Nigam, K., and R. Ghani. 2000. "Analyzing the effectiveness and applicability of co-training." Ninth International Confrence on Information and Knowledge Management(CIKM-00), 86-93.
Nigam, K., A. McCallum, S. Thrun, and T. Mitchell. 2000. "Text classification from labeled and unlabeled documents using EM." Machine Learning, 39(2/3): 103-134. https://doi.org/10.1023/A:1007692713085
Park, Seong-Bae, and Byong-Tak Zhang. 2004. "Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information." Information Processing & Management, 40(3): 421-439. https://doi.org/10.1016/j.ipm.2003.09.003
Silva, C., and B. Ribeiro. 2004. "Labeled and unlabeled data in text categorization." Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN-04), 2971-2976.
Wagstaff, K., C. Cardie, S. Rogers, and S. Schroedl. 2001. "Constrained k-means clustering with background knowledge." Proceedings of 18th International Conference on Machine Learning (ICML01), 577-584.
Witten, I. H., and E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco: Morgan Kaufmann.
Yu, Hwanjo, ChengXiang Zhai, and Jiawei Han. 2003. "Text classification from positive and unlabeled documents." Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM-03), 232-239.
Zhang, T. 2000. "The Value of unlabeled data for classification problems." Proceedings of 17th International Conference on Machine Learning (ICML-00). [cited 2006.10. 21.].

Cited by

Mapping Categories of Heterogeneous Sources Using Text Analytics vol.22, pp.4, 2016, https://doi.org/10.13088/jiis.2016.22.4.193
A Study of Intelligent Recommendation System based on Naive Bayes Text Classification and Collaborative Filtering vol.41, pp.4, 2010, https://doi.org/10.1633/JIM.2010.41.4.227
An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database vol.48, pp.3, 2014, https://doi.org/10.4275/KSLIS.2014.48.3.491

Journal of the Korean Society for information Management (정보관리학회지)

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities

문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)